In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

---

## Step 1: Generate Code Quality Scores Using `Flake8`

Since our dataset now includes the actual code for each function, we can use `Flake8` to objectively assess code quality.

### Why use `Flake8`?
- It's a **widely-used Python linter** that detects code smells, complexity, unused variables, and more.
- It gives a **numeric score out of 10** summarizing the overall code quality.
- This gives us an **automated, data-driven way** to assign quality scores instead of relying on hand-crafted heuristics.

### What we’ll do:
- Write each function’s code to a temporary Python file.
- Run `flake8` on that file.
- Parse the output to extract the numeric score.
- Store the score in a new column called `quality_score`.

### Note:
- This step has already been done in another script file named 'Score_quality.py'

---


In [2]:
df = pd.read_csv('../data/interim/merged_scored_chunks.csv')

In [3]:
df.isnull().sum()

name                             6
node_type                        0
file_path                        0
code_snippet                     0
repo_name                        0
repo_stars                       0
repo_forks                       0
repo_watchers                    0
repo_language                    0
repo_created_at                  0
repo_last_updated                0
repo_topics                      0
loc                              0
num_args                         0
num_returns                      0
num_variables                    0
num_function_calls               0
has_decorators                   0
uses_globals                     0
is_recursive                     0
estimated_branches          900000
estimated_difficulty             0
estimated_bugs                   0
has_docstring                    0
docstring_length                 0
num_comments                     0
name_length                      0
is_name_well_formed              0
bad_variable_names_c

In [4]:
columns_to_drop = [
    "name",
    "node_type",
    "file_path",
    "repo_name",
    "repo_stars",
    "repo_forks",
    "repo_watchers",
    "repo_language",
    "repo_created_at",
    "repo_last_updated",
    "repo_topics",
    "estimated_branches",  # all values null
    "quality"              # all values null will add it later when the model is finished
]

In [5]:
df = df.drop(columns=columns_to_drop)

In [6]:
df.dtypes

code_snippet                 object
loc                           int64
num_args                      int64
num_returns                   int64
num_variables                 int64
num_function_calls            int64
has_decorators                 bool
uses_globals                   bool
is_recursive                   bool
estimated_difficulty        float64
estimated_bugs              float64
has_docstring                  bool
docstring_length              int64
num_comments                  int64
name_length                   int64
is_name_well_formed            bool
bad_variable_names_count      int64
max_return_length             int64
estimated_complexity          int64
quality_score               float64
dtype: object

In [7]:
df.isnull().sum()

code_snippet                0
loc                         0
num_args                    0
num_returns                 0
num_variables               0
num_function_calls          0
has_decorators              0
uses_globals                0
is_recursive                0
estimated_difficulty        0
estimated_bugs              0
has_docstring               0
docstring_length            0
num_comments                0
name_length                 0
is_name_well_formed         0
bad_variable_names_count    0
max_return_length           0
estimated_complexity        0
quality_score               0
dtype: int64

In [8]:
duplicates = df.duplicated()
df[duplicates]

Unnamed: 0,code_snippet,loc,num_args,num_returns,num_variables,num_function_calls,has_decorators,uses_globals,is_recursive,estimated_difficulty,estimated_bugs,has_docstring,docstring_length,num_comments,name_length,is_name_well_formed,bad_variable_names_count,max_return_length,estimated_complexity,quality_score
147,"def button_released(self, e=None):\r\n ...",3,2,0,0,2,False,False,False,0.0,0.000000,False,0,0,15,True,0,0,1,8.0
149,"def focusin(self, e):\r\n self.selectio...",4,2,0,0,2,False,False,False,0.0,0.000000,False,0,0,7,True,0,0,2,7.0
406,def list_of_str(arg):\n return list(map(str...,2,1,1,0,3,False,False,False,0.0,0.000000,False,0,0,11,True,0,30,1,9.0
450,"def computeIoU(bbox1, bbox2):\n x1, y1, x2,...",13,2,1,11,6,False,False,False,2.4,0.091574,False,0,0,10,False,8,3,1,9.0
551,"def get_flow(self, x):\n b, n, c, h, w ...",10,2,1,5,7,False,False,False,3.5,0.008000,False,0,0,8,True,5,31,1,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
899900,def contain_base64(inputs):\n base64_string...,3,1,1,1,2,False,False,False,0.5,0.001585,False,0,0,14,True,0,23,1,8.0
899933,"def __conversation_history(self, history:list,...",13,3,1,4,5,False,False,False,1.8,0.009000,False,0,0,22,True,0,8,3,6.0
899973,def Singleton(cls):\n _instance = {}\n\n ...,9,1,2,2,1,False,False,False,0.5,0.001585,False,0,0,9,False,0,14,1,9.0
899974,"def _singleton(*args, **kargs):\n if cl...",4,0,1,1,1,False,False,False,0.5,0.001585,False,0,0,10,True,0,14,2,1.0


In [9]:
df = df.drop_duplicates()

In [10]:
# checking if the quality_score column has any null values if so remove them
df = df[df['quality_score'].notnull()]

---

## Step 5: Bin Scores into Quality Labels

Once we have `quality_score`, we classify it into discrete quality levels:
- **0–3** → `bad`
- **3–7** → `moderate`
- **7–10** → `good`

These categories will be stored in a new column: `quality_label`.

This prepares our dataset for classification tasks, where the model will learn to predict the label based on features.

---

In [11]:
df['quality'] = pd.cut(
	df['quality_score'],
	bins=[-float('inf'), 5, float('inf')],
	labels=['bad', 'good']
)


In [12]:
df = df[df['quality_score'].notna()]
if df.empty:
	raise ValueError("All quality scores are NaN. Check flake8 execution or input data.")

In [13]:
df.head()

Unnamed: 0,code_snippet,loc,num_args,num_returns,num_variables,num_function_calls,has_decorators,uses_globals,is_recursive,estimated_difficulty,...,has_docstring,docstring_length,num_comments,name_length,is_name_well_formed,bad_variable_names_count,max_return_length,estimated_complexity,quality_score,quality
0,"def download_pdf(link, location, name):\n t...",12,3,0,1,8,False,False,False,0.0,...,False,0,0,12,True,1,0,3,0.0,bad
1,def clean_pdf_link(link):\n if 'arxiv' in l...,8,1,1,2,4,False,False,False,1.0,...,False,0,0,14,True,0,4,3,7.0,good
2,"def clean_text(text, replacements = {':': '_',...",4,2,1,1,2,False,False,False,0.0,...,False,0,0,10,True,0,4,2,7.0,good
3,"def print_title(title, pattern = ""-""):\n pr...",2,2,0,0,3,False,False,False,0.5,...,False,0,0,11,True,0,0,1,7.0,good
4,def get_extension(link):\n extension = os.p...,7,1,3,1,1,False,False,False,0.666667,...,False,0,0,13,True,0,9,3,7.0,good


In [14]:
df.dropna(inplace=True)

Before we drop duplicate columns, we need to infer the dtype of each column. This is done to ensure dropping we'll drop all actual duplicates. For example, '2' and '2.0' are the same in value, but `drop_duplicates()` will consider them different and won't drop them. We need to infer the dtype to ensure they are considered the same.

In [15]:
df.count()

code_snippet                809945
loc                         809945
num_args                    809945
num_returns                 809945
num_variables               809945
num_function_calls          809945
has_decorators              809945
uses_globals                809945
is_recursive                809945
estimated_difficulty        809945
estimated_bugs              809945
has_docstring               809945
docstring_length            809945
num_comments                809945
name_length                 809945
is_name_well_formed         809945
bad_variable_names_count    809945
max_return_length           809945
estimated_complexity        809945
quality_score               809945
quality                     809945
dtype: int64

In [16]:
def coerce_types(df):
	for col in df.columns:
		if df[col].dtype == 'object':
			# First try numeric conversion
			try:
				df[col] = pd.to_numeric(df[col])
				continue
			except (ValueError, TypeError):
				pass
			
			# Handle mixed bools (strings + actual booleans)
			unique_vals = set(df[col].dropna().unique())
			bool_candidates = {'TRUE', 'FALSE', 'True', 'False', 'true', 'false', True, False}
			
			if unique_vals.issubset(bool_candidates):
				df[col] = df[col].apply(
					lambda x: x
						if isinstance(x, bool) 
						else str(x).lower() == 'true'
				)
	return df

df = coerce_types(df)

In [17]:
df.dtypes

code_snippet                  object
loc                            int64
num_args                       int64
num_returns                    int64
num_variables                  int64
num_function_calls             int64
has_decorators                  bool
uses_globals                    bool
is_recursive                    bool
estimated_difficulty         float64
estimated_bugs               float64
has_docstring                   bool
docstring_length               int64
num_comments                   int64
name_length                    int64
is_name_well_formed             bool
bad_variable_names_count       int64
max_return_length              int64
estimated_complexity           int64
quality_score                float64
quality                     category
dtype: object

In [18]:
df.drop_duplicates(inplace=True)

In [19]:
df.count()

code_snippet                809945
loc                         809945
num_args                    809945
num_returns                 809945
num_variables               809945
num_function_calls          809945
has_decorators              809945
uses_globals                809945
is_recursive                809945
estimated_difficulty        809945
estimated_bugs              809945
has_docstring               809945
docstring_length            809945
num_comments                809945
name_length                 809945
is_name_well_formed         809945
bad_variable_names_count    809945
max_return_length           809945
estimated_complexity        809945
quality_score               809945
quality                     809945
dtype: int64

## Step 6: Train-Test Split and Dataset Saving

Once the features are ready, we’ll split the dataset into **training** and **testing** sets.

### Why we do this:
- To **evaluate** model performance fairly.
- To **prevent data leakage** — the model should never "see" the test data during training.
- To **reuse** the same splits for all models and experiments.

### What we’ll do:
- Split 80% for training, 20% for testing using `train_test_split()`.
- Use `stratify=y` to preserve class proportions across splits.
- Save the resulting datasets (`X_train`, `X_test`, `y_train`, `y_test`) to the `data/processed/` folder so they can be easily loaded later in the training and evaluation notebooks.

---


In [20]:
# Test df has no duplicates before splitting 3l4an 2na hat4al
assert not df.duplicated().any()

MAKE SURE PYARROW IS INSTALLED BEFORE RUNNING THIS CELL!
```
pip install pyarrow
```

In [21]:
# Separating the features and target 
X = df.drop(columns=['quality_score', 'quality'])
y = df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Test df has no duplicates after train_test_split
assert not X_train.duplicated().any(), "Duplicates found in X_train"
assert not X_test.duplicated().any(), "Duplicates found in X_test"

In [22]:
X_train.duplicated().sum()

np.int64(0)

In [24]:
X_train.to_parquet("../data/processed/X_train.parquet", index=False)
X_test.to_parquet("../data/processed/X_test.parquet", index=False)
y_train.to_frame().to_parquet("../data/processed/y_train.parquet", index=False)
y_test.to_frame().to_parquet("../data/processed/y_test.parquet", index=False)