In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split

---

## Step 1: Generate Code Quality Scores Using `Flake8`

Since our dataset now includes the actual code for each function, we can use `Flake8` to objectively assess code quality.

### Why use `Flake8`?
- It's a **widely-used Python linter** that detects code smells, complexity, unused variables, and more.
- It gives a **numeric score out of 10** summarizing the overall code quality.
- This gives us an **automated, data-driven way** to assign quality scores instead of relying on hand-crafted heuristics.

### What we’ll do:
- Write each function’s code to a temporary Python file.
- Run `flake8` on that file.
- Parse the output to extract the numeric score.
- Store the score in a new column called `quality_score`.

### Note:
- This step has already been done in another script file named 'Score_quality.py'

---


In [25]:
df = pd.read_csv('../data/interim/function_features_with_scores.csv')

  df = pd.read_csv('../data/interim/function_features_with_scores.csv')


In [26]:
df.isnull().sum()

name                            15
node_type                       12
file_path                       12
code_snippet                    12
repo_name                      167
repo_stars                     356
repo_forks                     411
repo_watchers                  446
repo_language                  458
repo_created_at                465
repo_last_updated              472
repo_topics                    474
loc                            475
num_args                       475
num_returns                    478
num_variables                  478
num_function_calls             480
has_decorators                 505
uses_globals                   490
is_recursive                   482
estimated_branches          272618
estimated_difficulty           479
estimated_bugs                 481
has_docstring                  480
docstring_length               481
num_comments                   482
name_length                    484
is_name_well_formed            509
bad_variable_names_c

In [27]:
columns_to_drop = [
    "name",
    "node_type",
    "file_path",
    "repo_name",
    "repo_stars",
    "repo_forks",
    "repo_watchers",
    "repo_language",
    "repo_created_at",
    "repo_last_updated",
    "repo_topics",
    "estimated_branches",  # all values null
    "quality"              # all values null will add it later when the model is finished
]

In [28]:
df = df.drop(columns=columns_to_drop)

In [29]:
df.dtypes

code_snippet                 object
loc                          object
num_args                     object
num_returns                  object
num_variables                object
num_function_calls           object
has_decorators               object
uses_globals                 object
is_recursive                 object
estimated_difficulty         object
estimated_bugs               object
has_docstring                object
docstring_length             object
num_comments                 object
name_length                  object
is_name_well_formed          object
bad_variable_names_count     object
max_return_length            object
estimated_complexity         object
quality_score               float64
dtype: object

In [30]:
df.isnull().sum()

code_snippet                 12
loc                         475
num_args                    475
num_returns                 478
num_variables               478
num_function_calls          480
has_decorators              505
uses_globals                490
is_recursive                482
estimated_difficulty        479
estimated_bugs              481
has_docstring               480
docstring_length            481
num_comments                482
name_length                 484
is_name_well_formed         509
bad_variable_names_count    496
max_return_length           515
estimated_complexity        526
quality_score                24
dtype: int64

In [31]:
duplicates = df.duplicated()
df[duplicates]

Unnamed: 0,code_snippet,loc,num_args,num_returns,num_variables,num_function_calls,has_decorators,uses_globals,is_recursive,estimated_difficulty,estimated_bugs,has_docstring,docstring_length,num_comments,name_length,is_name_well_formed,bad_variable_names_count,max_return_length,estimated_complexity,quality_score
45,"def __eq__(self, other):\n return all(\...",7,2,1,0,3,FALSE,FALSE,FALSE,0.5,0.004643856,FALSE,0,0,6,TRUE,0,107.0,1.0,8.0
46,"def __ne__(self, other):\n return not s...",2,2,1,0,0,FALSE,FALSE,FALSE,1,0.00386988,FALSE,0,0,6,TRUE,0,17.0,1.0,8.0
163,def __enter__(self):\n return self,2,1,1,0,0,FALSE,FALSE,FALSE,0,0,FALSE,0,0,9,TRUE,0,4.0,1.0,8.0
164,"def __exit__(self, *args):\n self.close()",2,1,0,0,1,FALSE,FALSE,FALSE,0,0,FALSE,0,0,8,TRUE,0,0.0,1.0,8.0
525,def response_handler(sock):\n consu...,7,1,0,0,2,FALSE,FALSE,FALSE,0,0,FALSE,0,0,16,TRUE,0,0.0,1.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
272489,"def is_base_ty_like(self, base_ty: BaseTy) -> ...",2,2,1,0,1,FALSE,FALSE,FALSE,0,0,FALSE,0,0.0,15.0,True,0.0,34.0,1.0,7.0
272490,def is_symint_like(self) -> bool:\n ret...,2,1,1,0,1,FALSE,FALSE,FALSE,0,0,FALSE,0,0.0,14.0,True,0.0,26.0,1.0,8.0
272497,def alias_info(self) -> Annotation | None:\n ...,2,1,1,0,0,TRUE,FALSE,FALSE,0,0,FALSE,0,0.0,10.0,True,0.0,15.0,1.0,7.0
272499,def is_write(self) -> bool:\n return se...,2,1,1,0,0,TRUE,FALSE,FALSE,1,0.005169925,FALSE,0,0.0,8.0,True,0.0,56.0,2.0,8.0


In [32]:
df = df.drop_duplicates()

In [33]:
# checking if the quality_score column has any null values if so remove them
df = df[df['quality_score'].notnull()]

---

## Step 5: Bin Scores into Quality Labels

Once we have `quality_score`, we classify it into discrete quality levels:
- **0–3** → `bad`
- **3–7** → `moderate`
- **7–10** → `good`

These categories will be stored in a new column: `quality_label`.

This prepares our dataset for classification tasks, where the model will learn to predict the label based on features.

---

In [34]:
df['quality'] = pd.cut(
	df['quality_score'],
	bins=[-float('inf'), 5, float('inf')],
	labels=['bad', 'good']
)


In [35]:
df = df[df['quality_score'].notna()]
if df.empty:
	raise ValueError("All quality scores are NaN. Check flake8 execution or input data.")

In [36]:
df.head()

Unnamed: 0,code_snippet,loc,num_args,num_returns,num_variables,num_function_calls,has_decorators,uses_globals,is_recursive,estimated_difficulty,...,has_docstring,docstring_length,num_comments,name_length,is_name_well_formed,bad_variable_names_count,max_return_length,estimated_complexity,quality_score,quality
0,"def check_compatibility(urllib3_version, chard...",33,3,0,7,15,False,False,False,2.857142857,...,False,0,6,19,True,0,0.0,10.0,5.0,bad
1,def _check_cryptography(cryptography_version):...,12,1,1,2,5,False,False,False,0.5,...,False,0,1,19,True,0,0.0,3.0,6.0,good
2,"def to_native_string(string, encoding=""ascii"")...",11,2,1,2,2,False,False,False,0.0,...,True,3,0,16,True,0,3.0,2.0,8.0,good
3,"def unicode_is_ascii(u_string):\n """"""Determ...",13,1,2,0,2,False,False,False,0.0,...,True,5,0,16,True,0,5.0,3.0,9.0,good
4,"def _urllib3_request_context(\n request: ""P...",45,4,1,18,8,False,False,False,2.8125,...,False,0,4,24,True,0,26.0,10.0,0.0,bad


In [37]:
df.dropna(inplace=True)

Before we drop duplicate columns, we need to infer the dtype of each column. This is done to ensure dropping we'll drop all actual duplicates. For example, '2' and '2.0' are the same in value, but `drop_duplicates()` will consider them different and won't drop them. We need to infer the dtype to ensure they are considered the same.

In [38]:
df.count()

code_snippet                260202
loc                         260202
num_args                    260202
num_returns                 260202
num_variables               260202
num_function_calls          260202
has_decorators              260202
uses_globals                260202
is_recursive                260202
estimated_difficulty        260202
estimated_bugs              260202
has_docstring               260202
docstring_length            260202
num_comments                260202
name_length                 260202
is_name_well_formed         260202
bad_variable_names_count    260202
max_return_length           260202
estimated_complexity        260202
quality_score               260202
quality                     260202
dtype: int64

In [39]:
def coerce_types(df):
	for col in df.columns:
		if df[col].dtype == 'object':
			# First try numeric conversion
			try:
				df[col] = pd.to_numeric(df[col])
				continue
			except (ValueError, TypeError):
				pass
			
			# Handle mixed bools (strings + actual booleans)
			unique_vals = set(df[col].dropna().unique())
			bool_candidates = {'TRUE', 'FALSE', 'True', 'False', 'true', 'false', True, False}
			
			if unique_vals.issubset(bool_candidates):
				df[col] = df[col].apply(
					lambda x: x
						if isinstance(x, bool) 
						else str(x).lower() == 'true'
				)
	return df

df = coerce_types(df)

In [40]:
df.dtypes

code_snippet                  object
loc                          float64
num_args                     float64
num_returns                    int64
num_variables                  int64
num_function_calls             int64
has_decorators                  bool
uses_globals                    bool
is_recursive                    bool
estimated_difficulty         float64
estimated_bugs               float64
has_docstring                   bool
docstring_length               int64
num_comments                 float64
name_length                  float64
is_name_well_formed             bool
bad_variable_names_count     float64
max_return_length            float64
estimated_complexity         float64
quality_score                float64
quality                     category
dtype: object

In [41]:
df.drop_duplicates(inplace=True)

In [42]:
df.count()

code_snippet                258599
loc                         258599
num_args                    258599
num_returns                 258599
num_variables               258599
num_function_calls          258599
has_decorators              258599
uses_globals                258599
is_recursive                258599
estimated_difficulty        258599
estimated_bugs              258599
has_docstring               258599
docstring_length            258599
num_comments                258599
name_length                 258599
is_name_well_formed         258599
bad_variable_names_count    258599
max_return_length           258599
estimated_complexity        258599
quality_score               258599
quality                     258599
dtype: int64

## Step 6: Train-Test Split and Dataset Saving

Once the features are ready, we’ll split the dataset into **training** and **testing** sets.

### Why we do this:
- To **evaluate** model performance fairly.
- To **prevent data leakage** — the model should never "see" the test data during training.
- To **reuse** the same splits for all models and experiments.

### What we’ll do:
- Split 80% for training, 20% for testing using `train_test_split()`.
- Use `stratify=y` to preserve class proportions across splits.
- Save the resulting datasets (`X_train`, `X_test`, `y_train`, `y_test`) to the `data/processed/` folder so they can be easily loaded later in the training and evaluation notebooks.

---


In [43]:
# Test df has no duplicates before splitting 3l4an 2na hat4al
assert not df.duplicated().any()

MAKE SURE PYARROW IS INSTALLED BEFORE RUNNING THIS CELL!
```
pip install pyarrow
```

In [44]:
# Separating the features and target 
X = df.drop(columns=['quality_score', 'quality'])
y = df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Test df has no duplicates after train_test_split
assert not X_train.duplicated().any(), "Duplicates found in X_train"
assert not X_test.duplicated().any(), "Duplicates found in X_test"

In [45]:
X_train.duplicated().sum()

np.int64(0)

In [46]:
X_train.to_parquet("../data/processed/X_train.parquet", index=False)
X_test.to_parquet("../data/processed/X_test.parquet", index=False)
y_train.to_frame().to_parquet("../data/processed/y_train.parquet", index=False)
y_test.to_frame().to_parquet("../data/processed/y_test.parquet", index=False)