Problem statement: <b>Data Wrangling I</b> 
* Import necessary Python libraries (e.g., pandas, numpy).
* Load the dataset into a pandas DataFrame.
* Check for missing values using `info()` and `describe()`.
* Describe variables, their types, and dataset dimensions.
* Format and normalize data (check and convert data types).
* Convert categorical variables to numerical (encoding).

In [1]:
import pandas as pd  					      # for data manipulation and analysis
import numpy as np   					      # for numerical operations
import matplotlib.pyplot as plt  			  # for data visualization
import seaborn as sns  				          # advanced visualization

### **pd.read_csv()**
**Reads a comma-separated values (CSV)** file into a DataFrame.

Parameters:
<table>
<thead>
<tr><th>Parameter</th><th>Valid Values</th><th>When to Use (Use Case)</th></tr>
</thead>
<tbody>
<tr><td><code>filepath</code></td><td>str or file-like object</td><td>Path or URL to your CSV file.</td></tr>
<tr><td><code>sep</code></td><td>str (e.g., ',', '\t', '|')</td><td>Use when the delimiter is not a comma (e.g., tab-delimited file).</td></tr>
<tr><td><code>header</code></td><td>int or None</td><td>Use <code>header=None</code> when file has no header row.</td></tr>
<tr><td><code>names</code></td><td>list of str</td><td>Manually define column names if there’s no header or to rename columns.</td></tr>
<tr><td><code>usecols</code></td><td>list of str/int</td><td>Read only specific columns to save memory.</td></tr>
<tr><td><code>nrows</code></td><td>int</td><td>Read only a few rows for quick inspection.</td></tr>
</tbody>
</table>

### **df.head(n)**
Returns the first n rows of the DataFrame (default = 5).

In [2]:
df = pd.read_csv("1_StudentsPerformance.csv")

df.head()                                     # Display the first five rows

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,
1,female,group C,some college,standard,completed,69,90,88.0
2,female,group B,master's degree,standard,none,90,95,93.0
3,male,group A,associate's degree,free/reduced,none,47,57,44.0
4,male,group C,some college,standard,none,76,78,75.0


## **df.shape**
**Returns a tuple representing (rows, columns) of the DataFrame**.  
Output: Tuple → (int, int) like (1000, 8)  
Use Case: Quickly check dataset size for memory, processing time, or modeling.  

In [3]:
df.shape						# Get the shape of the dataset

(1000, 8)

### **df.describe()**
Generates **summary statistics of numerical columns** (or all columns with include=).  

Output: DataFrame of statistics like count, mean, std, min, max, percentiles.

Parameters:
<table>
<thead>
<tr><th>Parameter</th><th>Valid Values</th><th>When to Use</th></tr>
</thead>
<tbody>
<tr><td><code>include</code></td><td>'all', 'object', 'number', list of dtypes</td><td>Use to describe non-numeric columns or include all columns.</td></tr>
<tr><td><code>exclude</code></td><td>list of dtypes</td><td>Use to omit specific data types from summary.</td></tr>
</tbody>
</table>

In [4]:
df.describe()				# Summary statistics of the dataset

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,999.0
mean,66.089,69.169,68.048048
std,15.16308,14.600192,15.202102
min,0.0,17.0,10.0
25%,57.0,59.0,57.5
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


### **df.info()**
Prints a concise summary of DataFrame including column data types, null counts, and memory usage.

Output: Summary (printed to console, not returned). 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   gender                       1000 non-null   object 
 1   race/ethnicity               1000 non-null   object 
 2   parental level of education  1000 non-null   object 
 3   lunch                        1000 non-null   object 
 4   test preparation course      1000 non-null   object 
 5   math score                   1000 non-null   int64  
 6   reading score                1000 non-null   int64  
 7   writing score                999 non-null    float64
dtypes: float64(1), int64(2), object(5)
memory usage: 62.6+ KB


## **df.isnull()**
**Detects missing values (NaN) in a DataFrame or Series.**

Output: A DataFrame or Series of the same shape with boolean values: True for missing entries, False otherwise.

Use Case: Identify and handle missing data in datasets.

**sum():** Aggregates values along a specified axis. When used after isnull(), it counts the number of missing values.  

In [6]:
df.isnull().sum()					# Check for missing values

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  1
dtype: int64

In [7]:
#drop NaN Values
df = df.dropna()

In [8]:
#Change data type using astype()
df.loc[:, "writing score"] = df["writing score"].astype(int)

### **LabelEncoder()**
**Used to convert categorical labels into numerical form**. It assigns each unique category an integer value between 0 and n_classes-1.

Output: An array of integers representing the encoded labels.

Use Case:Transform categorical target variables (i.e., the dependent variable y) into a numerical format suitable for machine learning algorithms.

In [9]:
from sklearn.preprocessing import LabelEncoder

In [10]:
le = LabelEncoder()

In [11]:
cat_columns = ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']

In [12]:
for col in cat_columns:
    df[col] = le.fit_transform(df[col])

In [13]:
df

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
1,0,2,4,1,0,69,90,88.0
2,0,1,3,1,1,90,95,93.0
3,1,0,0,0,1,47,57,44.0
4,1,2,4,1,1,76,78,75.0
5,0,1,0,1,1,71,83,78.0
...,...,...,...,...,...,...,...,...
995,0,4,3,1,0,88,99,95.0
996,1,2,2,0,1,62,55,55.0
997,0,2,2,0,0,59,71,65.0
998,0,3,4,1,0,68,78,77.0


### **MinMaxScaler()**
**Transforms features by scaling each feature to a given range, typically between zero and one.**

Output: A numpy array or DataFrame with scaled features.

Use Case: Normalize features for machine learning algorithms sensitive to the scale of input data.

In [14]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

numeric_columns = ['math score', 'reading score', 'writing score']

df[numeric_columns] = scaler.fit_transform(df[numeric_columns])


In [15]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
1,0,2,4,1,0,0.69,0.879518,0.866667
2,0,1,3,1,1,0.9,0.939759,0.922222
3,1,0,0,0,1,0.47,0.481928,0.377778
4,1,2,4,1,1,0.76,0.73494,0.722222
5,0,1,0,1,1,0.71,0.795181,0.755556
