# JawwadAhmad_Assignment


**Purpose:** Example Jupyter Notebook demonstrating data exploration and preprocessing using Pandas.

**Dataset:** A small sample dataset named `students` (16 rows) with columns: `Name, Age, Gender, Marks, City, EnrollmentDate`.


In [1]:
# Create the sample DataFrame
import pandas as pd

students = {
    "Name": [
        "Ali", "Sara", "Ahmed", "Hina", "Omar", "Zara", "Bilal", "Aisha",
        "Noor", "Sameer", "Ali", "Fatima", "Yusuf", "Sana",
        "Sara", "Ahmed"  # üëà intentional duplicates (same as row 2 and 3)
    ],
    "Age": [
        20, 22, 21, 23, 20, 21, 22, 24,
        22, 23, 20, None, 21, 22,
        22, 21
    ],
    "Gender": [
        "Male", "Female", "Male", "Female", "Male", "Female", "Male", "Female",
        "Female", "Male", "Male", "Female", "Male", "Female",
        "Female", "Male"
    ],
    "Marks": [
        85, 90, 78, 92, 85, None, 88, 95,
        76, 85, 85, 82, 90, 88,
        90, 78
    ],
    "City": [
        "Lahore", "Karachi", "Lahore", "Islamabad", "Karachi", "Lahore", "Multan", "Peshawar",
        "Lahore", "Karachi", "Lahore", "Islamabad", "Multan", "Karachi",
        "Karachi", "Lahore"
    ],
    # Added mixed and inconsistent date formats, including missing ones
    "EnrollmentDate": [
        "2024-01-15",   # YYYY-MM-DD
        "15/02/2024",   # DD/MM/YYYY
        "2024-03-01",   # YYYY-MM-DD
        "March 5, 2024",# Month name format
        "2024-01-20",   # YYYY-MM-DD
        "",             # empty string (missing)
        "2024-02-28",   # YYYY-MM-DD
        "2024-03-10",   # YYYY-MM-DD
        "2024/03/12",   # YYYY/MM/DD
        "05-02-2024",   # DD-MM-YYYY
        "2024-01-15",   # duplicate of first (Ali)
        "2024-03-01",   # duplicate of Ahmed‚Äôs date
        None,           # missing
        "28/02/2024",   # DD/MM/YYYY
        "15/02/2024",   # duplicate of Sara‚Äôs date
        "2024-03-01"    # duplicate of Ahmed‚Äôs date
    ]
}

jawwad_ahmad_df = pd.DataFrame(students)
jawwad_ahmad_df

Unnamed: 0,Name,Age,Gender,Marks,City,EnrollmentDate
0,Ali,20.0,Male,85.0,Lahore,2024-01-15
1,Sara,22.0,Female,90.0,Karachi,15/02/2024
2,Ahmed,21.0,Male,78.0,Lahore,2024-03-01
3,Hina,23.0,Female,92.0,Islamabad,"March 5, 2024"
4,Omar,20.0,Male,85.0,Karachi,2024-01-20
5,Zara,21.0,Female,,Lahore,
6,Bilal,22.0,Male,88.0,Multan,2024-02-28
7,Aisha,24.0,Female,95.0,Peshawar,2024-03-10
8,Noor,22.0,Female,76.0,Lahore,2024/03/12
9,Sameer,23.0,Male,85.0,Karachi,05-02-2024


## 1. `head()` ‚Äî show first rows
**Full name:** Head method

**Purpose:** To view the first few rows of the dataset.

**Default:** Shows first 5 rows, but you can specify a number.

`df.head()`

In [2]:
jawwad_ahmad_df.head()

Unnamed: 0,Name,Age,Gender,Marks,City,EnrollmentDate
0,Ali,20.0,Male,85.0,Lahore,2024-01-15
1,Sara,22.0,Female,90.0,Karachi,15/02/2024
2,Ahmed,21.0,Male,78.0,Lahore,2024-03-01
3,Hina,23.0,Female,92.0,Islamabad,"March 5, 2024"
4,Omar,20.0,Male,85.0,Karachi,2024-01-20


## 2. `tail()` ‚Äî show last rows

**Full name:** Tail method

**Purpose:** To view the last few rows of the dataset.

`df.tail()`

In [3]:
jawwad_ahmad_df.tail()

Unnamed: 0,Name,Age,Gender,Marks,City,EnrollmentDate
11,Fatima,,Female,82.0,Islamabad,2024-03-01
12,Yusuf,21.0,Male,90.0,Multan,
13,Sana,22.0,Female,88.0,Karachi,28/02/2024
14,Sara,22.0,Female,90.0,Karachi,15/02/2024
15,Ahmed,21.0,Male,78.0,Lahore,2024-03-01


## 3. `info()` ‚Äî information about columns, non-null counts, and dtypes

**Full name:** Information summary

**Purpose:** Shows column names, non-null counts, and data types. Also helps detect missing values.

`df.info()`

In [4]:
jawwad_ahmad_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Name            16 non-null     object 
 1   Age             15 non-null     float64
 2   Gender          16 non-null     object 
 3   Marks           15 non-null     float64
 4   City            16 non-null     object 
 5   EnrollmentDate  15 non-null     object 
dtypes: float64(2), object(4)
memory usage: 900.0+ bytes


## 4. `shape` ‚Äî number of rows and columns

**Full name:** Shape attribute

**Purpose:** Returns the number of rows and columns in the dataset.

`df.shape`

In [5]:
jawwad_ahmad_df.shape

(16, 6)

## 5. `describe()` ‚Äî descriptive statistics for numeric columns

**Full name:** Descriptive statistics summary


**Purpose:** Provides summary statistics for all numeric columns.

**Includes** mean, standard deviation, minimum, maximum, and quartiles.

`df.describe()`

In [6]:
jawwad_ahmad_df.describe()

Unnamed: 0,Age,Marks
count,15.0,15.0
mean,21.6,85.8
std,1.183216,5.479833
min,20.0,76.0
25%,21.0,83.5
50%,22.0,85.0
75%,22.0,90.0
max,24.0,95.0


## 6. `isnull().sum()` ‚Äî count of missing values per column

**Full name:** Check for missing (null) values

**Purpose:** Shows how many missing (NaN) values each column has.

`df.isnull().sum()`

In [7]:
jawwad_ahmad_df.isnull().sum()

Name              0
Age               1
Gender            0
Marks             1
City              0
EnrollmentDate    1
dtype: int64

## 7. `corr()` ‚Äî correlation matrix for numeric columns

**Full name:** Correlation matrix

**Purpose:** Shows how numerical columns are related to each other.

**Values range** from -1 to +1.

**+1 ‚Üí** strong positive relationship

**-1 ‚Üí **strong negative relationship

**0 ‚Üí **no relationship


`df.corr()`

In [8]:
jawwad_ahmad_df.corr(numeric_only=True)


Unnamed: 0,Age,Marks
Age,1.0,0.468866
Marks,0.468866,1.0


## 8. `columns` ‚Äî list column names

**Full name:** Columns attribute

**Purpose:** Lists all column names in the dataset.

`df.columns`

In [9]:
jawwad_ahmad_df.columns

Index(['Name', 'Age', 'Gender', 'Marks', 'City', 'EnrollmentDate'], dtype='object')

## 9. `dtypes` ‚Äî data types of each column

**Full name:** Data types attribute

**Purpose:** Displays the data type of each column.

**Meaning:** Shows whether columns contain numbers, text, dates, etc.

`df.dtypes`

In [10]:
jawwad_ahmad_df.dtypes

Name               object
Age               float64
Gender             object
Marks             float64
City               object
EnrollmentDate     object
dtype: object

## 10. `value_counts()` ‚Äî count unique values in a column

**Full name:** Value counts method

**Purpose:** Counts how many times each unique value appears in a column.

**Meaning:** Tells how many times each city appears ‚Äî useful for categorical data.

`df['City'].value_counts()`

In [11]:
jawwad_ahmad_df['City'].value_counts()

City
Lahore       6
Karachi      5
Islamabad    2
Multan       2
Peshawar     1
Name: count, dtype: int64

## 11. `groupby().mean()` ‚Äî group by a column and compute mean

**Full name:** Group by method

**Purpose:** Groups the data based on a column and calculates statistics like mean, sum, etc.

**Meaning:** Finds the average of all numeric columns for each city.

`df.groupby('City')['Marks'].mean()`

In [12]:
jawwad_ahmad_df.groupby('City')['Marks'].mean()

City
Islamabad    87.0
Karachi      87.6
Lahore       80.4
Multan       89.0
Peshawar     95.0
Name: Marks, dtype: float64

## 12. `sort_values()` ‚Äî sort rows by a column

**Full name:** Sort values method

**Purpose:** Sorts the dataset by one or more columns.

`df.sort_values(by='Marks', ascending=False)`

In [13]:
jawwad_ahmad_df.sort_values(by='Marks', ascending=False)

Unnamed: 0,Name,Age,Gender,Marks,City,EnrollmentDate
7,Aisha,24.0,Female,95.0,Peshawar,2024-03-10
3,Hina,23.0,Female,92.0,Islamabad,"March 5, 2024"
1,Sara,22.0,Female,90.0,Karachi,15/02/2024
12,Yusuf,21.0,Male,90.0,Multan,
14,Sara,22.0,Female,90.0,Karachi,15/02/2024
6,Bilal,22.0,Male,88.0,Multan,2024-02-28
13,Sana,22.0,Female,88.0,Karachi,28/02/2024
10,Ali,20.0,Male,85.0,Lahore,2024-01-15
4,Omar,20.0,Male,85.0,Karachi,2024-01-20
0,Ali,20.0,Male,85.0,Lahore,2024-01-15


## 13. `dropna()` ‚Äî remove rows with missing values

**Full name:** Drop missing values

**Purpose:** Removes rows or columns that contain missing (NaN) values.

**Meaning:** Cleans the dataset by deleting incomplete data.

`df.dropna()`

In [14]:
jawwad_ahmad_df.dropna()

Unnamed: 0,Name,Age,Gender,Marks,City,EnrollmentDate
0,Ali,20.0,Male,85.0,Lahore,2024-01-15
1,Sara,22.0,Female,90.0,Karachi,15/02/2024
2,Ahmed,21.0,Male,78.0,Lahore,2024-03-01
3,Hina,23.0,Female,92.0,Islamabad,"March 5, 2024"
4,Omar,20.0,Male,85.0,Karachi,2024-01-20
6,Bilal,22.0,Male,88.0,Multan,2024-02-28
7,Aisha,24.0,Female,95.0,Peshawar,2024-03-10
8,Noor,22.0,Female,76.0,Lahore,2024/03/12
9,Sameer,23.0,Male,85.0,Karachi,05-02-2024
10,Ali,20.0,Male,85.0,Lahore,2024-01-15


## 14. `fillna()` ‚Äî fill missing values with a specified value

**Full name:** Fill missing values

**Purpose:** Fills missing values with a specific value (e.g., 0, mean, or median).

**Meaning:** Keeps all data rows by filling empty cells instead of deleting them.

`df.fillna(0)`

In [15]:
jawwad_ahmad_df.fillna(0)

Unnamed: 0,Name,Age,Gender,Marks,City,EnrollmentDate
0,Ali,20.0,Male,85.0,Lahore,2024-01-15
1,Sara,22.0,Female,90.0,Karachi,15/02/2024
2,Ahmed,21.0,Male,78.0,Lahore,2024-03-01
3,Hina,23.0,Female,92.0,Islamabad,"March 5, 2024"
4,Omar,20.0,Male,85.0,Karachi,2024-01-20
5,Zara,21.0,Female,0.0,Lahore,
6,Bilal,22.0,Male,88.0,Multan,2024-02-28
7,Aisha,24.0,Female,95.0,Peshawar,2024-03-10
8,Noor,22.0,Female,76.0,Lahore,2024/03/12
9,Sameer,23.0,Male,85.0,Karachi,05-02-2024


## 15. `unique()` ‚Äî unique values in a column

**Full name:** Unique values method

**Purpose:** Returns all unique/distinct values from a column.

**Meaning:** Shows all different city names in that column.

`df['Gender'].unique()`

In [16]:
jawwad_ahmad_df['Gender'].unique()

array(['Male', 'Female'], dtype=object)

## 16. `apply(lambda x: ...)` ‚Äî apply a function to each element in a column

**Full name:** Apply function with lambda expression

**Purpose:** Applies a custom function to each value in a column. Here, it converts text to uppercase. Convert city names to uppercase

In [17]:
jawwad_ahmad_df['City'].apply(lambda x: x.upper())

0        LAHORE
1       KARACHI
2        LAHORE
3     ISLAMABAD
4       KARACHI
5        LAHORE
6        MULTAN
7      PESHAWAR
8        LAHORE
9       KARACHI
10       LAHORE
11    ISLAMABAD
12       MULTAN
13      KARACHI
14      KARACHI
15       LAHORE
Name: City, dtype: object

## 17. `sample()` ‚Äî random sample of rows

**Full name:** Random sample method

**Purpose:** Returns a random sample of rows from the dataset.

`df.sample(3)`

In [18]:
jawwad_ahmad_df.sample(3)

Unnamed: 0,Name,Age,Gender,Marks,City,EnrollmentDate
12,Yusuf,21.0,Male,90.0,Multan,
2,Ahmed,21.0,Male,78.0,Lahore,2024-03-01
13,Sana,22.0,Female,88.0,Karachi,28/02/2024


## 18. `pivot_table()` ‚Äî summary table

**Full name:** Pivot table method

**Purpose:** Summarizes data ‚Äî similar to Excel pivot tables. You can calculate mean, sum, count, etc. grouped by another column.

**Meaning:** Shows the average temperature for each city.

`df.pivot_table(values='Marks', index='City', aggfunc='mean')`

In [19]:
jawwad_ahmad_df.pivot_table(values='Marks', index='City', aggfunc='mean')

Unnamed: 0_level_0,Marks
City,Unnamed: 1_level_1
Islamabad,87.0
Karachi,87.6
Lahore,80.4
Multan,89.0
Peshawar,95.0



# üßπ Preprocess Data

---

### üîπ 1. Handle Missing Data  
**Full Name:** Handle Missing Data using `fillna()` or `dropna()`  

**Purpose:** To fix missing (NaN) values in the dataset so that calculations and analysis can be done without errors.  

**Meaning:** Missing values are replaced with a meaningful value (like 0, mean, or median) or removed completely. Here, we fill missing marks with the **average marks** of all students.


In [None]:
# Step 1: Handle missing data

# Create a copy of the original DataFrame to keep original data safe
df_filled = jawwad_ahmad_df.copy()

# Replace (fill) all missing values in the 'Marks' column
# with the average (mean) of the 'Marks' column
df_filled['Marks'] = df_filled['Marks'].fillna(df_filled['Marks'].mean())

# Print message to indicate process completion
print('\nBefore fillna (Marks):')
print(jawwad_ahmad_df)

# Print message to indicate process completion
print('\n\nAfter fillna (Marks):')

# Display the DataFrame after filling missing values
print(df_filled)

Before fillna (Marks):
      Name   Age  Gender  Marks       City EnrollmentDate
0      Ali  20.0    Male   85.0     Lahore     2024-01-15
1     Sara  22.0  Female   90.0    Karachi     15/02/2024
2    Ahmed  21.0    Male   78.0     Lahore     2024-03-01
3     Hina  23.0  Female   92.0  Islamabad  March 5, 2024
4     Omar  20.0    Male   85.0    Karachi     2024-01-20
5     Zara  21.0  Female    NaN     Lahore               
6    Bilal  22.0    Male   88.0     Multan     2024-02-28
7    Aisha  24.0  Female   95.0   Peshawar     2024-03-10
8     Noor  22.0  Female   76.0     Lahore     2024/03/12
9   Sameer  23.0    Male   85.0    Karachi     05-02-2024
10     Ali  20.0    Male   85.0     Lahore     2024-01-15
11  Fatima   NaN  Female   82.0  Islamabad     2024-03-01
12   Yusuf  21.0    Male   90.0     Multan           None
13    Sana  22.0  Female   88.0    Karachi     28/02/2024
14    Sara  22.0  Female   90.0    Karachi     15/02/2024
15   Ahmed  21.0    Male   78.0     Lahore     20

### üîπ 2. Remove Duplicates  
**Full Name:** Remove Duplicate Records using `drop_duplicates()`  

**Purpose:** To delete any rows that appear more than once in the dataset.  

**Meaning:** Ensures every record is unique so that repeated entries do not affect analysis or results.


In [100]:
# Step 2: Remove duplicate rows

# Print message to show action performed
print('\nBefore drop_duplicates:')

# Display the DataFrame after removing duplicates
print(df_filled)

# Remove any duplicate rows from the dataset
df_no_dup = df_filled.drop_duplicates()

# Print message to show action performed
print('\nAfter drop_duplicates:')

# Display the DataFrame after removing duplicates
print(df_no_dup)


Before drop_duplicates:
      Name   Age  Gender  Marks       City EnrollmentDate
0      Ali  20.0    Male   85.0     Lahore     2024-01-15
1     Sara  22.0  Female   90.0    Karachi     15/02/2024
2    Ahmed  21.0    Male   78.0     Lahore     2024-03-01
3     Hina  23.0  Female   92.0  Islamabad  March 5, 2024
4     Omar  20.0    Male   85.0    Karachi     2024-01-20
5     Zara  21.0  Female   85.8     Lahore               
6    Bilal  22.0    Male   88.0     Multan     2024-02-28
7    Aisha  24.0  Female   95.0   Peshawar     2024-03-10
8     Noor  22.0  Female   76.0     Lahore     2024/03/12
9   Sameer  23.0    Male   85.0    Karachi     05-02-2024
10     Ali  20.0    Male   85.0     Lahore     2024-01-15
11  Fatima   NaN  Female   82.0  Islamabad     2024-03-01
12   Yusuf  21.0    Male   90.0     Multan           None
13    Sana  22.0  Female   88.0    Karachi     28/02/2024
14    Sara  22.0  Female   90.0    Karachi     15/02/2024
15   Ahmed  21.0    Male   78.0     Lahore     

### üîπ 3. Convert Data Types  
**Full Name:** Convert Data Types using `pd.to_datetime()`  

**Purpose:** Converts text-based date columns into **datetime** objects, allowing date-based operations like filtering or sorting.  

**Meaning:** This ensures the ‚ÄúDate‚Äù column is correctly understood as a date instead of simple text.


In [101]:
# Step 3: Convert data types (example)

jawwad_ahmad_df['EnrollmentDate'] = pd.to_datetime(
    jawwad_ahmad_df['EnrollmentDate'],  # the column to convert
    dayfirst=True,         # treat first number as day (for formats like 15/02/2024)
    errors='coerce',       # turn invalid/missing dates into NaT instead of crashing
    format='mixed'         # allows Pandas to handle multiple date formats in one column
)

jawwad_ahmad_df

Unnamed: 0,Name,Age,Gender,Marks,City,EnrollmentDate
0,Ali,20.0,Male,85.0,Lahore,2024-01-15
1,Sara,22.0,Female,90.0,Karachi,2024-02-15
2,Ahmed,21.0,Male,78.0,Lahore,2024-03-01
3,Hina,23.0,Female,92.0,Islamabad,2024-03-05
4,Omar,20.0,Male,85.0,Karachi,2024-01-20
5,Zara,21.0,Female,,Lahore,NaT
6,Bilal,22.0,Male,88.0,Multan,2024-02-28
7,Aisha,24.0,Female,95.0,Peshawar,2024-03-10
8,Noor,22.0,Female,76.0,Lahore,2024-03-12
9,Sameer,23.0,Male,85.0,Karachi,2024-02-05



### üîπ 4. Encode Categories  
**Full Name:** Encode Categorical Variables using `pd.get_dummies()`  

**Purpose:** To convert **textual (categorical)** columns like ‚ÄúGender‚Äù and ‚ÄúCity‚Äù into **numeric (0 and 1)** columns so machine learning algorithms can process them.  

**Meaning:** Converts each category into a separate column with binary indicators (1 for yes, 0 for no).


In [102]:
# Step 4: Encode categorical columns

# Convert categorical columns ('Gender' and 'City') into numeric columns
# Each unique category (e.g., Male/Female, Lahore/Karachi/Islamabad)
# becomes a separate column with 0 or 1 values

df_encoded = pd.get_dummies(
    df_no_dup,             # The DataFrame we want to encode
    columns=['Gender', 'City'],  # The categorical columns to encode
    drop_first=False       # Keeps all category columns (set to True to drop one)
)

# Print message for confirmation
print('\nAfter get_dummies:')

# Display encoded DataFrame
print(df_encoded)


After get_dummies:
      Name   Age  Marks EnrollmentDate  Gender_Female  Gender_Male  \
0      Ali  20.0   85.0     2024-01-15          False         True   
1     Sara  22.0   90.0     15/02/2024           True        False   
2    Ahmed  21.0   78.0     2024-03-01          False         True   
3     Hina  23.0   92.0  March 5, 2024           True        False   
4     Omar  20.0   85.0     2024-01-20          False         True   
5     Zara  21.0   85.8                          True        False   
6    Bilal  22.0   88.0     2024-02-28          False         True   
7    Aisha  24.0   95.0     2024-03-10           True        False   
8     Noor  22.0   76.0     2024/03/12           True        False   
9   Sameer  23.0   85.0     05-02-2024          False         True   
11  Fatima   NaN   82.0     2024-03-01           True        False   
12   Yusuf  21.0   90.0           None          False         True   
13    Sana  22.0   88.0     28/02/2024           True        False   
