1.1.
 In Pandas, missing values are represented by None or NaN. We can identify missing value using: isnull() and notnull().



In [16]:
import pandas as pd
import numpy as np
data = {'Name': ['Alice', 'Bob', 'Charlie', None, 'Emma'],
        'Age': [25, None, 30, 22, None],
        'City': ['New York', 'Los Angeles', None, 'Chicago', 'Boston']}

df = pd.DataFrame(data)
print(df)
print(df.isnull()) #Shows True where values are missing
print(df.notnull()) #Shows True where values are not missing
print(df.dropna()) #Drops rows with missing values

      Name   Age         City
0    Alice  25.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  30.0         None
3     None  22.0      Chicago
4     Emma   NaN       Boston
    Name    Age   City
0  False  False  False
1  False   True  False
2  False  False   True
3   True  False  False
4  False   True  False
    Name    Age   City
0   True   True   True
1   True  False   True
2   True   True  False
3  False   True   True
4   True  False   True
    Name   Age      City
0  Alice  25.0  New York


1.2
Imputation is the process of replacing missing or incomplete data with substituted values. 

In [17]:
df_filled = df.fillna("Unknown")
print("DataFrame with missing values filled with 'Unknown':\n", df_filled)

# Create a copy of the DataFrame for numeric imputation
df_numeric = df.copy()

df_numeric['Age'] = df_numeric['Age'].fillna(df_numeric['Age'].mean())
print("\nDataFrame with 'Age' filled with mean:\n", df_numeric)

df_numeric['Age'] = df['Age'].fillna(df['Age'].median())
print("\nDataFrame with 'Age' filled with median:\n", df_numeric)

df_numeric['Age'] = df['Age'].fillna(df['Age'].mode()[0])
print("\nDataFrame with 'Age' filled with mode:\n", df_numeric)

# Fill missing values using forward fill (previous value)
df_ffill = df.ffill()
print("\nDataFrame with forward fill:\n", df_ffill)

# Fill missing values using backward fill (next value)
df_bfill = df.bfill()
print("\nDataFrame with backward fill:\n", df_bfill)



DataFrame with missing values filled with 'Unknown':
       Name      Age         City
0    Alice     25.0     New York
1      Bob  Unknown  Los Angeles
2  Charlie     30.0      Unknown
3  Unknown     22.0      Chicago
4     Emma  Unknown       Boston

DataFrame with 'Age' filled with mean:
       Name        Age         City
0    Alice  25.000000     New York
1      Bob  25.666667  Los Angeles
2  Charlie  30.000000         None
3     None  22.000000      Chicago
4     Emma  25.666667       Boston

DataFrame with 'Age' filled with median:
       Name   Age         City
0    Alice  25.0     New York
1      Bob  25.0  Los Angeles
2  Charlie  30.0         None
3     None  22.0      Chicago
4     Emma  25.0       Boston

DataFrame with 'Age' filled with mode:
       Name   Age         City
0    Alice  25.0     New York
1      Bob  22.0  Los Angeles
2  Charlie  30.0         None
3     None  22.0      Chicago
4     Emma  22.0       Boston

DataFrame with forward fill:
       Name   Age      

In [18]:
# Fill missing values using forward fill (previous value)
df_ffill = df.ffill()
print("\nDataFrame with forward fill:\n", df_ffill)

# Fill missing values using backward fill (next value)
df_bfill = df.bfill()
print("\nDataFrame with backward fill:\n", df_bfill)


DataFrame with forward fill:
       Name   Age         City
0    Alice  25.0     New York
1      Bob  25.0  Los Angeles
2  Charlie  30.0  Los Angeles
3  Charlie  22.0      Chicago
4     Emma  22.0       Boston

DataFrame with backward fill:
       Name   Age         City
0    Alice  25.0     New York
1      Bob  30.0  Los Angeles
2  Charlie  30.0      Chicago
3     Emma  22.0      Chicago
4     Emma   NaN       Boston


2.1
Categorical variables can be encoded to convert non-numeric data into numerical values. The most common methods are label encoding and one-hot encoding.

In [19]:
# Label encoding( Assigns a unique integer to each category)
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Cherry', 'Apple']})
df['Fruit_encoded'] = df['Fruit'].astype('category').cat.codes
print(df)

    Fruit  Fruit_encoded
0   Apple              0
1  Banana              1
2  Cherry              2
3   Apple              0


2.2
One-hot encoding is a technique used in data preprocessing to convert categorical variables into a numerical format that machine learning algorithms can use. Instead of representing a categorical value by assigning it a unique integer, one-hot encoding creates binary columns for each category.

In [20]:
#One-Hot Encoding (Creates separate binary columns for each category)
import pandas as pd
df = pd.DataFrame({'Fruit': ['Apple', 'Banana', 'Cherry', 'Apple']})
df_encoded = pd.get_dummies(df, columns=['Fruit']).astype(int)
print(df_encoded)

   Fruit_Apple  Fruit_Banana  Fruit_Cherry
0            1             0             0
1            0             1             0
2            0             0             1
3            1             0             0


3
 To find duplicate rows in a Pandas DataFrame, use .duplicated() method. To remove duplicate rows, use .drop_duplicates().
The duplicated() method is used to identify duplicate rows in a DataFrame, while the drop_duplicates() method is used to remove duplicate rows from a DataFrame.

In [21]:
import pandas as p
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Emma'],
        'Age': [25, 30, 35, 25, 28]}
df = pd.DataFrame(data)
print(df.duplicated())  # Returns True for duplicate rows
df_no_duplicates = df.drop_duplicates() # Drops duplicate rows
print(df_no_duplicates)


0    False
1    False
2    False
3     True
4    False
dtype: bool
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
4     Emma   28


4
Scaling the features makes it easier for algorithms to find the optimal solution.Min-Max Scaling is suitable when we need to scale our data to a specific range and can handle the case when data needs to be bounded.

In [22]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
data = pd.DataFrame({
    'Student': ['A', 'B', 'C', 'D', 'E'],
    'Height (sm)': [150, 160, 170, 180, 190]
})

# Min-Max Scaling
min_max_scaler = MinMaxScaler()
data_minmax = data.copy()
data_minmax[['Height (sm)']] = min_max_scaler.fit_transform(data[['Height (sm)']])

# Z-Score Normalization
z_score_scaler = StandardScaler()
data_zscore = data.copy()
data_zscore[['Height (sm)']] = z_score_scaler.fit_transform(data[['Height (sm)']])

print("Initial data:")
print(data)
print("Min-Max Scaling:")
print(data_minmax)
print("Z-Score Normalization:")
print(data_zscore)


Initial data:
  Student  Height (sm)
0       A          150
1       B          160
2       C          170
3       D          180
4       E          190
Min-Max Scaling:
  Student  Height (sm)
0       A         0.00
1       B         0.25
2       C         0.50
3       D         0.75
4       E         1.00
Z-Score Normalization:
  Student  Height (sm)
0       A    -1.414214
1       B    -0.707107
2       C     0.000000
3       D     0.707107
4       E     1.414214


An outlier is a data point that is noticeably different from the rest. They represent errors in measurement, bad data collection, or simply show variables not considered when collecting the data.
The Z-Score method standardizes data and finds how far a data point is from the mean in terms of standard deviations. 
The IQR method calculates the range between the first quartile (25th percentile) and the third quartile (75th percentile). It helps identify outliers by setting boundaries.


In [23]:
import numpy as np
import pandas as pd
from scipy import stats
# Sample dataset with outliers
weights = pd.Series([70, 72, 68, 75, 69, 71, 80, 85, 90, 95, 200])


# Detecting outliers using IQR
Q1 = weights.quantile(0.25)
Q3 = weights.quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_IQR = weights[(weights < lower_bound) | (weights > upper_bound)]
print("\nOutliers IQR:")
print(outliers_IQR)

z_scores = stats.zscore(weights)

outliers_zscore = weights[np.abs(z_scores) > 3]
print("\nOutliers Z-score:")
print(outliers_zscore)

#Delete outliers
weights_no_outliers_IQR = weights[(weights >= lower_bound) & (weights <= upper_bound)]
print("\nWeights without outliers (IQR method):")
print(weights_no_outliers_IQR)

#Delete with Z-score
weights_no_outliers_zscore = weights[np.abs(z_scores) <= 3]
print("\nWeights without outliers (Z-score method):")
print(weights_no_outliers_zscore)


Outliers IQR:
10    200
dtype: int64

Outliers Z-score:
10    200
dtype: int64

Weights without outliers (IQR method):
0    70
1    72
2    68
3    75
4    69
5    71
6    80
7    85
8    90
9    95
dtype: int64

Weights without outliers (Z-score method):
0    70
1    72
2    68
3    75
4    69
5    71
6    80
7    85
8    90
9    95
dtype: int64
