## Loading the Dataset
Verify that the dataset is loaded and inspect the initial structure.

In [2]:
import pandas as pd

# Load the dirty dataset
df = pd.read_csv('student_performance_dirty.csv')
print("Dataset loaded successfully!")
df.head()

Dataset loaded successfully!


Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index,Notes
0,7.0,99,Yes,9.0,1,91.0,excellent
1,4.0,82,No,4.0,2,65.0,review
2,8.0,51,Yes,7.0,2,45.0,review
3,5.0,52,Yes,5.0,2,36.0,excellent
4,7.0,75,No,8.0,5,66.0,review


## Exploratory Data Analysis (EDA)
Explore the dataset to understand its structure and detect issues.

- Understand the dimensions and types of the data.
- Identify missing values and duplicates.
- Spot columns that may need conversion or removal.

In [5]:
# Check the overall shape and column names
print("Dataset shape is:", df.shape, "\n")
print("Columns are:", df.columns.tolist(), "\n")

# Get a basic statistical summary (numeric columns only)
# print(df.describe())

# Check data types for each column
print("Data types:\n", df.dtypes, "\n")

# Identify missing values in each column
print("Missing values:\n", df.isnull().sum(), "\n")

# Identify duplicate rows
duplicates = df.duplicated().sum()
print("Number of duplicate rows:", duplicates)


Dataset shape is: (10005, 7) 

Columns are: ['Hours Studied', 'Previous Scores', 'Extracurricular Activities', 'Sleep Hours', 'Sample Question Papers Practiced', 'Performance Index', 'Notes'] 

Data types:
 Hours Studied                       float64
Previous Scores                       int64
Extracurricular Activities           object
Sleep Hours                         float64
Sample Question Papers Practiced      int64
Performance Index                   float64
Notes                                object
dtype: object 

Missing values:
 Hours Studied                       1032
Previous Scores                        0
Extracurricular Activities             0
Sleep Hours                          496
Sample Question Papers Practiced       0
Performance Index                      0
Notes                                  0
dtype: int64 

Number of duplicate rows: 44


## Data Cleaning and Preprocessing

### Handling missing values
Impute missing numeric values using the mean for "Hours Studied" and "Sleep Hours."
- Use imputation to handle missing numerical data effectively.

In [7]:
# Impute missing values for 'Hours Studied' and 'Sleep Hours'
df['Hours Studied'].fillna(df['Hours Studied'].mean(), inplace=True)
df['Sleep Hours'].fillna(df['Sleep Hours'].mean(), inplace=True)

# or drop rows with missing values
# df.dropna(inplace=True)

### Converting Data Types
Convert "Previous Scores" from strings back to numeric.
- Ensure that data types are correct before modeling.

In [8]:
# Convert 'Previous Scores' to numeric
df['Previous Scores'] = pd.to_numeric(df['Previous Scores'], errors='coerce')
print("Converted 'Previous Scores' data type:", df['Previous Scores'].dtypes)

Converted 'Previous Scores' data type: int64


### Removing Duplicates
Remove duplicate rows to avoid bias in training.
- Duplicate removal helps prevent data redundancy.

In [9]:
# Remove duplicate rows
df.drop_duplicates(inplace=True)
print("Duplicates removed. New dataset shape:", df.shape)

Duplicates removed. New dataset shape: (9961, 7)


### Dropping Irrelevant Columns
Remove columns that are not required for the regression task (e.g., "Notes").
- Eliminate features that do not contribute to the predictive task.

In [10]:
# Drop the irrelevant 'Notes' column
df.drop(columns=['Notes'], inplace=True)
print("Columns after dropping irrelevant data:", df.columns.tolist())

Columns after dropping irrelevant data: ['Hours Studied', 'Previous Scores', 'Extracurricular Activities', 'Sleep Hours', 'Sample Question Papers Practiced', 'Performance Index']


### Outlier Detection and Treatment
Handle outliers in "Sample Question Papers Practiced" by capping values at the 95th percentile.
- Outlier treatment minimizes their impact on the regression model.

In [12]:
import numpy as np 

# Identify the 95th percentile for 'Sample Question Papers Practiced'
upper_limit = df['Sample Question Papers Practiced'].quantile(0.95)
df['Sample Question Papers Practiced'] = np.where(
    df['Sample Question Papers Practiced'] > upper_limit, 
    upper_limit, 
    df['Sample Question Papers Practiced']
)
print("Outliers in 'Sample Question Papers Practiced' capped at:", upper_limit)

Outliers in 'Sample Question Papers Practiced' capped at: 9.0


## Final Check and Readiness for Modeling
Conduct a final review of the processed dataset to ensure it is ready for linear regression.
- Confirm that all cleaning, normalization, and preprocessing steps have been successfully applied before model training.

In [13]:
# Final check for missing values and data types
print("Missing values after preprocessing:\n", df.isnull().sum())
print("Data types after cleaning:\n", df.dtypes)

# Final preview of the cleaned dataset
print("Final dataset preview:")
df.head()

Missing values after preprocessing:
 Hours Studied                       0
Previous Scores                     0
Extracurricular Activities          0
Sleep Hours                         0
Sample Question Papers Practiced    0
Performance Index                   0
dtype: int64
Data types after cleaning:
 Hours Studied                       float64
Previous Scores                       int64
Extracurricular Activities           object
Sleep Hours                         float64
Sample Question Papers Practiced    float64
Performance Index                   float64
dtype: object
Final dataset preview:


Unnamed: 0,Hours Studied,Previous Scores,Extracurricular Activities,Sleep Hours,Sample Question Papers Practiced,Performance Index
0,7.0,99,Yes,9.0,1.0,91.0
1,4.0,82,No,4.0,2.0,65.0
2,8.0,51,Yes,7.0,2.0,45.0
3,5.0,52,Yes,5.0,2.0,36.0
4,7.0,75,No,8.0,5.0,66.0


In [14]:
# save data
df.to_csv('student_performance_clean.csv', index=False)