In [None]:
import pandas as pd
import numpy as np
from scipy.stats import zscore

1. Handling Missing Data Questions:
   Handling missing values in a Pandas DataFrame is a common task in data preprocessing. Missing values can significantly impact the results of your       data analysis or machine learning models, so it's crucial to identify and handle them appropriately. Here are some steps and techniques to manage       missing values in Pandas:
   isnull() or isna() Methods: These methods return a boolean mask indicating whether elements are missing or not. 

In [30]:
data = pd.read_csv('tested.csv')
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [32]:
missing_values_new = data.isnull().sum()

missing_values_new

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [34]:
age_not_null = data[data['Age'].notnull()]

age_not_null.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Imputation is the process of replacing missing data with substituted values. It is useful because it allows for the complete analysis of datasets, improves the accuracy of statistical models by using all available data, and helps maintain the integrity of a dataset by avoiding the loss of data points due to missing values.

In [35]:
age_mean = data['Age'].mean()

data['Age'].fillna(age_mean, inplace=True)
missing_values_age_after_imputation = data['Age'].isnull().sum()

age_mean, missing_values_age_after_imputation

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(age_mean, inplace=True)


(30.272590361445783, 0)

2. Data Transformation Questions:

How can you encode categorical variables in a Pandas DataFrame?
What is one-hot encoding, and when would you use it in data preprocessing?

One-hot encoding is a method to convert categorical variables into a binary (0 or 1) matrix in a Pandas DataFrame. It's used to ensure machine learning models treat categories without assuming an order. You would use it in data preprocessing when dealing with nominal categorical data, to transform it into a format suitable for machine learning algorithms that require numerical input. Pandas provides the get_dummies() function for one-hot encoding.

In [37]:
embarked_encoded = pd.get_dummies(data['Embarked'], prefix='Embarked')

embarked_encoded.head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,False,True,False
1,False,False,True
2,False,True,False
3,False,False,True
4,False,False,True


3. Removing Duplicates Questions:

How do you identify and remove duplicate rows from a DataFrame?
Can you explain the difference between the duplicated() and drop_duplicates() methods in Pandas?

To identify duplicate rows in a DataFrame, use duplicated(), which returns a boolean series indicating duplicates. 
To remove duplicates, use drop_duplicates(), which returns a DataFrame with duplicates removed.
The main difference is that duplicated() identifies duplicates without removing them, while drop_duplicates() removes duplicates,
offering options to keep the first, last, or no duplicates.






In [8]:
data = pd.read_csv('tested.csv')

# Assuming 'data' is your DataFrame and you want to append the first row to the end of it
data_with_duplicate = pd.concat([data,data.iloc[[0]]], ignore_index=True)

# Now you can proceed with identifying and removing duplicates as before
duplicates = data_with_duplicate.duplicated()
print("Identified duplicates:\n", duplicates.tail())

data_no_duplicates = data_with_duplicate.drop_duplicates()
print("\nAfter removing duplicates, any duplicates left:", data_no_duplicates.duplicated().any())

# Displaying the last few rows to verify
data_no_duplicates.tail()

Identified duplicates:
 414    False
415    False
416    False
417    False
418     True
dtype: bool

After removing duplicates, any duplicates left: False


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
413,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
417,1309,0,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


4. Data Scaling and Normalization Questions:

Discuss the importance of feature scaling in machine learning.
Explain the difference between min-max scaling and z-score normalization.

Feature scaling is essential in machine learning to ensure all features contribute equally to the model's performance, improving the speed and accuracy of learning algorithms. Min-max scaling adjusts features to a specific range, typically [0, 1], making it suitable for algorithms requiring bounded values. Z-score normalization standardizes features to have a mean of 0 and a standard deviation of 1, ideal for algorithms that assume a normal distribution. The choice between them depends on your data and the model's requirements.

In [7]:
data = pd.read_csv('tested.csv')
age_min = data['Age'].min()
age_max = data['Age'].max()
age_min_max_scaled_manual = (data['Age'] - age_min) / (age_max - age_min)

# Manual Z-Score Normalization
age_mean = data['Age'].mean()
age_std = data['Age'].std()
age_z_score_scaled_manual = (data['Age'] - age_mean) / age_std

# Creating a DataFrame for comparison
age_comparison_manual = pd.DataFrame({
    'Original Age': data['Age'],
    'Min-Max Scaled Age (Manual)': age_min_max_scaled_manual,
    'Z-Score Normalized Age (Manual)': age_z_score_scaled_manual
})

age_comparison_manual.head()

Unnamed: 0,Original Age,Min-Max Scaled Age (Manual),Z-Score Normalized Age (Manual)
0,34.5,0.452723,0.298099
1,47.0,0.617566,1.179547
2,62.0,0.815377,2.237285
3,27.0,0.353818,-0.230769
4,22.0,0.287881,-0.583349


5. Handling Outliers Questions:

What are outliers, and why might they impact machine learning models?
Describe different methods for detecting outliers in a dataset in Python
How can you handle outliers in a continuous numerical variable in Python?

1-Outliers are data points that significantly differ from other observations in a dataset. They can be a result of variability in the measurement or due to experimental errors
2-Methods for Detecting Outliers in a Dataset in Python
Standard Deviation Method: If the data follows a Gaussian distribution, then outliers can be defined as values that lie beyond 3 standard deviations from the mean.

Interquartile Range (IQR) Method: The IQR is the difference between the 75th and 25th percentile of the data. An outlier is any value below the 25th percentile - 1.5IQR or above the 75th percentile + 1.5IQR.

Boxplots: Visual method using boxplots, where points lying outside the whiskers of the boxplot are considered outl
3-Removing: If you ascertain that the outliers are due to errors or are extremely high or low, you might consider removing them.

Transforming: Applying transformations (log, square root, etc.) can reduce the effect of outliers.

Imputation: Replace outliers with more representative values, such as the median or mean of the dataset.

Capping: Outliers are set to a specific threshold value. For instance, values greater than the 95th percentile are set to the value at the 95th percentile.

Using Robust Methods: Some machine learning models are less sensitive to outliers. Methods like Random Forests, or using algorithms robust to outliers like RANSAC regressor, can mitigate the need for handling outliers explicitly.iers.