<a href="https://colab.research.google.com/github/SaraAljuraybah/Data-Mining-Project/blob/main/Reports/Phase2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phase 2: Data Summarization and Pre-processing

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
sns.set(style="whitegrid")
df = pd.read_excel("Raw_dataset.xlsx")
print("Shape:", df.shape)
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'Raw_dataset.xlsx'

# 1 . Data Analysis
****
**1.1 Statical Summary**

In [None]:
df.describe()


The statistical summary provides an overview of the dataset. For example, the Age attribute shows an average of around 18 years, which is reasonable since the dataset represents undergraduate students. However, the maximum value of 48 years indicates the presence of an outlier that should be considered during preprocessing.

Some attributes such as Gender and Name of College are shown as numerical values in the summary, but in reality, they are categorical features that have been encoded using numbers. Therefore, their mean and standard deviation are not meaningful and should be interpreted through frequency counts instead.

**1.2 Missing values**

In [None]:
df.isnull().sum()


The missing values analysis shows that there are no null or missing entries in the dataset. This is an advantage since it ensures the dataset is complete and reduces the need for imputation or dropping rows/columns. Having a complete dataset allows us to focus on other preprocessing tasks such as handling outliers, encoding categorical features, and normalization.


**1.3 Plots**
- Histogram (Age)

In [None]:
plt.figure(figsize=(8,5))
sns.histplot(df["Age"], bins=20, kde=True, color="skyblue")
plt.title("Distribution of Age", fontsize=14)
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

The histogram above shows the distribution of students’ ages. Most students are between 17 and 20 years old, which is expected for undergraduate levels. The data is slightly right-skewed, meaning there are a few older participants aged above 25. This pattern confirms that the dataset mainly represents young students, while the presence of older ages (up to 48) might indicate outliers that should be checked during the preprocessing phase.



- Boxplot (Age)

In [None]:
plt.figure(figsize=(6,4))
sns.boxplot(x=df["Age"], color="lightcoral")
plt.title("Boxplot of Age", fontsize=14)
plt.xlabel("Age")
plt.show()


The boxplot of the Age attribute shows that the majority of students are between 17 and 20 years old, with a median age of 18. Several outliers are present, including values above 25 and one extreme case at 48 years, which may indicate data entry errors or older students. These outliers should be carefully considered during preprocessing.

- Barplot (gender)


In [None]:
plt.figure(figsize=(6,4))
sns.countplot(data=df, x="Gender", hue="Gender", palette="Set2", legend=False)
plt.title("Distribution of Gender", fontsize=14)
plt.xlabel("Gender (1 = Male, 2 = Female)")
plt.ylabel("Count")
plt.show()


The bar plot of the Gender attribute shows the distribution of male and female students. The dataset indicates that there are more male students compared to female students. However, the appearance of an unexpected value (4) suggests the presence of data entry or coding errors that need to be addressed during preprocessing.

- Barplot (Use Instagram)

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(data=df, x="Use Instagram", hue="Use Instagram", palette="pastel", legend=False)
plt.title("Distribution of Instagram Usage", fontsize=14)
plt.xlabel("Use Instagram (0 = No, 1 = Yes)")
plt.ylabel("Count")
plt.show()

The plot shows Instagram usage distribution: most students use Instagram (1), while fewer do not (0). One invalid entry (11) was detected, which should be corrected during preprocessing.

In [None]:
print("Unique values in Use Instagram:", df["Use Instagram"].unique())

invalid_count = df[df["Use Instagram"] == 11].shape[0]
print("Number of invalid entries (11):", invalid_count)

**1.4 Class label distribution**

In [None]:
plt.figure(figsize=(7,4))
sns.countplot(data=df, x="Impact on Academic Performance", hue="Impact on Academic Performance", palette="viridis", legend=False)
plt.title("Distribution of Impact on Academic Performance", fontsize=14)
plt.xlabel("Impact on Academic Performance (1 = Very Low → 5 = Very High, 0 = Invalid)", fontsize=10)
plt.ylabel("Count")
plt.show()

print(df["Impact on Academic Performance"].value_counts())


The plot shows the distribution of the class label Impact on Academic Performance. Most students fall between levels 2 to 5, with the highest counts at level 3 (747) and level 5 (713). Fewer students are at levels 1 (561) and 2 (462), while only 22 invalid entries (0) were detected. These invalid values will need to be handled during preprocessing.

# 2. Data Pre-processing
****

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df_raw = pd.read_excel("Raw_dataset.xlsx")
df=df_raw.copy()
print("shape:",df.shape)
df.head()

**2.1 Duplicates Removal (Cleaning data)**

In [None]:
 #first we cheak for duplicates
dup_count = df.duplicated().sum()
print("Number of duplicates: ",dup_count)

print("Shape before removing duplicates: ", df.shape)
#removing the duplicates
df = df.drop_duplicates()
print("Shape after removing duplicates: ", df.shape)

- We detected 166 duplicate rows and removed them using drop_duplicates(). This step prevents redundancy and ensures each record is unique. Result: Dataset reduced from 3028 to 2862 rows

- First we checked unique values
- Detecting invalid values is important to correct data entry errors and improve reliability

- Result: Marked invalid values for cleaning in the next step



**Variable transformation**

Variable transformation is the process of modifying data values to improve consistency and prepare them for analysis. It helps ensure that all features are on comparable scales and that patterns become clearer.

 In this project, transformations such as discretization and normalization were applied to selected columns to standardize ranges (e.g., converting 0–5 scales into 0–1)to help making the data more suitable for interpretation and modeling.


**Discretization**

In [None]:
discret= 'Age'
numOfBins=10
df['discretized'] =pd.cut(df[discret], bins=numOfBins, labels=False)
print(df[['Age', 'discretized']])


in this step, the 'Age' column was discretized using the pd.cut() function to divide continuous numeric values into fixed intervals (bins).

This transformation groups similar age values together, simplifying the data and making patterns easier to identify.


The resulting “discretized” column assigns each age to a specific category (e.g., 0, 1, 2), which helps in comparing and analyzing the data more effectively during later stages of modeling.

**Square root transformation**

In [None]:
import numpy as np
import pandas as pd

df['Time_spent_sqrt']=np.sqrt(df['Time Spent'])
print(df[['Time Spent', 'Time_spent_sqrt']])

A square root transformation was applied to the 'Time Spent' column, which represents a 0–5 scale.
This transformation reduces the effect of higher scale values and smooths out differences between responses, making the distribution more balanced while keeping the overall meaning of the data the same.

**Z-score normalization**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()

df['Frequency of Use_zscore']=scaler.fit_transform(df[['Frequency of Use']])
print(df[['Frequency of Use', 'Frequency of Use_zscore']])

Z-score normalization was applied to the 'Frequency of Use'
column to standardize its values. This transformation centers the data around a mean of 0 and a standard deviation of 1, This makes the data easier to compare with other variables and prevents scale differences from affecting the analysis.

so every value in the column becomes:

	•	0 → the average use frequency.

	•	positive → above average.

	•	negative → below average.

**Min-Max normalization**




In [None]:
from sklearn.preprocessing import MinMaxScaler
sacler= MinMaxScaler()
df[['Impact on Academic Performance_scaled']]=sacler.fit_transform(df[['Impact on Academic Performance']])
print(df[['Impact on Academic Performance', 'Impact on Academic Performance_scaled']])

The 'Impact on Academic Performance' column was normalized using the Min–Max scaling technique, which transforms all values into a range between 0 and 1.

This method ensures consistency across numerical features, allowing fair comparison and improving the performance of upcoming analysis and modeling steps.

**2.2 Invalid Values Detection (Cleaning data)**

In [None]:
#invalid values
print("Unique values in Gender: ",df["Gender"].unique())
print("Unique values in Use Instagram: ",df["Use Instagram"].unique())
print("Unique values in Impact on Academic Performance: ",df["Impact on Academic Performance"].unique())

- Invalid entries in Gender, Use Instagram, and Impact on Academic Performance were replaced with NaN. This prepares the dataset for consistent imputation instead of treating wrong values as valid.

- Result: Columns now contain only valid categories plus NaN

**Handle Invalid Values**

In [None]:
import numpy as np

# invalid values handling

# Gender
df.loc[~df["Gender"].isin([1,2]) , "Gender" ] = np.nan

# Use Instagram
df.loc[~df["Use Instagram"].isin([0,1]) , "Use Instagram" ] = np.nan

# Impact on Academic Performance
df.loc[~df["Impact on Academic Performance"].isin([1,2,3,4,5]) , "Impact on Academic Performance" ] = np.nan


#printing the values for duble checking
print("Unique values in Gender: ",df["Gender"].unique())
print("Unique values in Use Instagram: ",df["Use Instagram"].unique())
print("Unique values in Impact on Academic Performance: ",df["Impact on Academic Performance"].unique())

احس يبيلها شرح

☹

**Imputation -Binary- (mode/ median)**

In [None]:
# impute Gender and Use Instagram with mode
for col in ["Gender","Use Instagram"]:
    mode_val = df[col].mode()[0]
    df[col] = df[col].fillna(mode_val)
    print(col, "NaN filled with mode:", mode_val)

# impute Impact on Academic Performance with median (rounded)
median_val = round(df["Impact on Academic Performance"].median())
df["Impact on Academic Performance"] = df["Impact on Academic Performance"].fillna(median_val)
print("Impact on Academic Performance NaN filled with median:", median_val)

- We used mode to impute missing values in categorical columns (such as Gender and Use Instagram), because it represents the most frequent value and preserves the original distribution of the data

- For the target column (Impact on Academic Performance), we used the median since it is more suitable for ordinal data (Likert scale) and is less affected by outliers. We also applied the round() function to ensure the imputed value is an integer between 1 and 5

**Outliers Handling -Numric-(Age - IQR)**

In [None]:
Q1= df["Age"].quantile(0.25)
Q3= df["Age"].quantile(0.75)
IQR = Q3 -Q1
lower_bound = Q1-1.5 *IQR
upper_bound = Q3 + 1.5 *IQR


df_cleaned =df[(df["Age"]>= lower_bound)&(df["Age"]<= upper_bound)]
print("Shape Before removing :",df.shape)
print("Shape After removing :", df_cleaned.shape)

- We applied the IQR method to detect and remove outliers in the Age column. A total of 84 records were identified as outliers (ages above 23, including an extreme value of 48). These rows were removed to reduce noise and ensure that the dataset better represents the actual distribution of students. The dataset size decreased slightly, but this has minimal impact compared to the benefit of having cleaner data

----جزئية ساره من هنا -----

**2.4 Manual feature selection**

In [None]:

desired_cols = [
    "Age",
    "Gender",
    "Year of Study",
    "Use Instagram",
    "Use Twitter",
    "Use Snapchat",
    "Use Tiktok",
    "Time Spent",
    "Academic Purpose",
    "Entertainment",
    "Social Interaction",
    "Addiction",
    "Difficulty in Concentrating on Studies",
    "Impact on Academic Performance",  # target
]


existing = [c for c in desired_cols if c in df.columns]
missing  = [c for c in desired_cols if c not in df.columns]

print("Shape before manual selection:", df.shape)
print("Missing columns (check spelling / caps):", missing)

df_model = df[existing].copy()

print("Shape after manual selection :", df_model.shape)
print("Final kept columns:", list(df_model.columns))


In [None]:
df_model.head()

**2.5 Encode categorical features**

We converted categorical variables into numeric dummy columns to be usable by ML models.


**2.8 Standardization (for clustering & models sensitive to
scale)**

**2.9 Save final datasets**