# 1Handling Missing Data Questions:

** How do you identify and handle missing values in a Pandas DataFrame?**

Use isnull() to find missing values

To fix them:

Remove rows: df.dropna()

Fill with a value: df.fillna(0)

**What is imputation, and why is it useful?**

Imputation involves replacing missing values with estimated values, such as the mean, median, or mode of the column. It is useful because it preserves the dataset size and avoids the loss of valuable information that occurs when rows are dropped.

In [3]:
import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [25, None, 30, 22],
    'City': ['New York', 'Los Angeles', None, 'Chicago']
}

df = pd.DataFrame(data)
print(df.isnull().sum())

Name    1
Age     1
City    1
dtype: int64


# 2. Data Transformation Questions:

**How can you encode categorical variables in a Pandas DataFrame?**

Use Label Encoding to turn categories into numbers
Use One-Hot Encoding to make separate columns for each category

**What is one-hot encoding, and when do you use it?**

One-hot encoding creates 0 and 1 columns for each category.
Use it when the categories have no order, like colors (red, blue, green).


In [1]:
import pandas as pd

data = {
    'Fruit': ['Apple', 'Banana', 'Orange', 'Apple', 'Banana'],
    'Price': [100, 50, 80, 110, 55]
}
df = pd.DataFrame(data)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['encoded'] = le.fit_transform(df['Fruit'])
print(df)


    Fruit  Price  encoded
0   Apple    100        0
1  Banana     50        1
2  Orange     80        2
3   Apple    110        0
4  Banana     55        1


In [10]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

encoded_df = pd.get_dummies(df['City'])
print(encoded_df)

   Chicago  Los Angeles  New York
0    False        False      True
1    False         True     False
2     True        False     False


# **3 Removing Duplicates Questions:**

**How do you find and remove duplicate rows?**
 I show in example 

**Can you explain the difference between the duplicated() and drop_duplicates() methods in Pandas?**
duplicated() returns a Boolean Series indicating whether each row is a duplicate of a previous row.

drop_duplicates() removes duplicate rows and keeps the first occurrence by default, unless specified otherwise.



In [None]:
#To identify duplicate rows:
df.duplicated().sum()

#To remove duplicate rows:
df.drop_duplicates(inplace=True)


# 4. Data Scaling and Normalization Questions:

Discuss the importance of feature scaling in machine learning.

Feature scaling ensures that all features have the same range, preventing models from being biased toward features with larger values. It improves convergence speed in gradient-based algorithms like linear regression and helps distance-based algorithms like KNN and K-means work more effectively.

In [None]:
# Min-Max Scaling: Scales features to a range of [0, 1]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df[['feature']])

#Z-Score Normalization (Standardization): Centers the data around 0 with a standard deviation of 1:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(df[['feature']])

# 5. Handling Outliers Questions:

**What are outliers, and why might they impact machine learning models?**
Outliers are data points significantly different from other observations. They can skew statistical metrics (mean, standard deviation) and negatively affect model performance, especially in linear models and clustering algorithms.

**How can you handle outliers in a continuous numerical variable in Python?**

Remove them: Use IQR or Z-score methods to filter out extreme values.
Cap or floor them: Replace extreme values with upper or lower percentiles

Transform them: Apply logarithmic or Box-Cox transformations to reduce skewness.
Impute them: Replace outliers with the median or mean of the distribution.


In [None]:
#Describe different methods for detecting outliers in a dataset in Python.
#isualization methods: Box plots, scatter plots, and histograms.
#Statistical methods:
#Z-score: Values with a Z-score > 3 or < -3 are potential outliers.

import numpy as np
z_scores = np.abs((df['feature'] - df['feature'].mean()) / df['feature'].std())
outliers = df[z_scores > 3]

#IQR (Interquartile Range): Identifies outliers as values outside 1.5 times the IQR.

Q1 = df['feature'].quantile(0.25)
Q3 = df['feature'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['feature'] < (Q1 - 1.5 * IQR)) | (df['feature'] > (Q3 + 1.5 * IQR))]


# WORK WITH CSV

In [None]:

import pandas as pd

df = pd.read_csv("data.csv")


print(df.isnull().sum())  
df.dropna(inplace=True)
df.fillna(df.median(), inplace=True)




Duration    0
Pulse       0
Maxpulse    0
Calories    5
dtype: int64


In [None]:
df['Category'] = df['Pulse'].apply(lambda x: 'High' if x > 120 else 'Low')
df = pd.get_dummies(df, columns=['Category'])
df





Unnamed: 0,Duration,Pulse,Maxpulse,Calories,Category_High,Category_Low,Category_High.1,Category_Low.1,Category_High.2,Category_Low.2
0,60,110,130,409.1,False,True,False,True,False,True
1,60,117,145,479.0,False,True,False,True,False,True
2,60,103,135,340.0,False,True,False,True,False,True
3,45,109,175,282.4,False,True,False,True,False,True
4,45,117,148,406.0,False,True,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...
164,60,105,140,290.8,False,True,False,True,False,True
165,60,110,145,300.0,False,True,False,True,False,True
166,60,115,145,310.2,False,True,False,True,False,True
167,75,120,150,320.4,False,True,False,True,False,True


In [76]:
print( df.duplicated().sum())
df = df.drop_duplicates()
print( df.duplicated().sum())
df = df.loc[:, ~df.columns.duplicated()]

print(df.head())

0
0
   Duration  Pulse  Maxpulse  Calories  Category_High  Category_Low
0        60    110       130     409.1          False          True
1        60    117       145     479.0          False          True
2        60    103       135     340.0          False          True
3        45    109       175     282.4          False          True
4        45    117       148     406.0          False          True


In [78]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

scaler_minmax = MinMaxScaler()
df_scaled = pd.DataFrame(scaler_minmax.fit_transform(df), columns=df.columns)

scaler_zscore = StandardScaler()
df_normalized = pd.DataFrame(scaler_zscore.fit_transform(df), columns=df.columns)
df


Unnamed: 0,Duration,Pulse,Maxpulse,Calories,Category_High,Category_Low
0,60,110,130,409.1,False,True
1,60,117,145,479.0,False,True
2,60,103,135,340.0,False,True
3,45,109,175,282.4,False,True
4,45,117,148,406.0,False,True
...,...,...,...,...,...,...
164,60,105,140,290.8,False,True
165,60,110,145,300.0,False,True
166,60,115,145,310.2,False,True
167,75,120,150,320.4,False,True


In [79]:
from scipy import stats
import numpy as np

z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))
outliers = (z_scores > 3).sum()
print(outliers)


8
