###### 1.Identifying missing values: 
-To discover information in our data we can use info() and duplicated() methods, and also isnull() or isna():

-info(): This method provides a concise summary of the DataFrame, including the count of non-null values for each column. Columns with missing values will have a count less than the total number of rows.

-isnull() or isna() methods: These methods return a DataFrame of the same shape as the original, where each cell is True if the corresponding element is NaN (missing), and False otherwise.

In [11]:
import pandas as pd 

df = pd.read_csv('data.csv')

print(df.info()) # Summary of non-null values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 169 entries, 0 to 168
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  169 non-null    int64  
 1   Pulse     169 non-null    int64  
 2   Maxpulse  169 non-null    int64  
 3   Calories  164 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.4 KB
None


In [12]:
import pandas as pd 

df=pd.read_csv('data.csv')
print(df.isnull())  # Boolean mask indicating missing values

     Duration  Pulse  Maxpulse  Calories
0       False  False     False     False
1       False  False     False     False
2       False  False     False     False
3       False  False     False     False
4       False  False     False     False
..        ...    ...       ...       ...
164     False  False     False     False
165     False  False     False     False
166     False  False     False     False
167     False  False     False     False
168     False  False     False     False

[169 rows x 4 columns]


In [29]:
print(df.duplicated())

0      False
1      False
2      False
3      False
4      False
       ...  
164    False
165    False
166    False
167    False
168    False
Length: 169, dtype: bool


###### Handling Missing Values:
-You may decide to use the dropna() method to remove rows or columns containing missing values if the amount of missing values is minimal in relation to the dataset's size and has no apparent effect on the analysis.

-Replace an appropriate estimate for any missing values. Typical approaches include utilizing the fillna() function to fill in missing data with the mean, median, mode, or a constant value.

In [26]:
print(df.dropna())  # Drop rows with any missing value

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
164        60    105       140     290.8
165        60    110       145     300.0
166        60    115       145     310.2
167        75    120       150     320.4
168        75    125       150     330.4

[169 rows x 4 columns]


In [25]:
df.fillna(50, inplace=True)
print(df.to_string())  # Fill missing values with number 50

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112      50.0
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

###### Imputation involves substituting values for lacking data. The replaced values are generally approximated using different statistical techniques or generated from the known portion of the data. Several factors make imputation helpful when handling missing data:
 
-Preservation of Data Integrity

-Maintaining Sample Size

-Improved Analysis Accuracy

-Handling Practical Constraints

In [28]:
a = df.fillna(df.mean())
print(a.to_string()) # Fill missing values with mean(imputation)

     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
5          60    102       127     300.0
6          60    110       136     374.0
7          45    104       134     253.3
8          30    109       133     195.1
9          60     98       124     269.0
10         60    103       147     329.3
11         60    100       120     250.7
12         60    106       128     345.3
13         60    104       132     379.3
14         60     98       123     275.0
15         60     98       120     215.2
16         60    100       120     300.0
17         45     90       112      50.0
18         60    103       123     323.0
19         45     97       125     243.0
20         60    108       131     364.2
21         45    100       119     282.0
22         60    130       101     300.0
23         45   

###### 2.  There are three methods of encoding:

Ordinal Encoding: This approach works well with categorical variables that have an inherent ordering. Based on their ordinal relationship, categories can be manually given numerical values.

One-Hot Encoding: When there is no ordinal link between the categories, one-hot encoding is employed. For every category, it generates binary columns that show whether the category is present or absent for every observation.

Label Encoding: Using this technique, every category is given a distinct integer. When there are many categories in the categorical variable and one-hot encoding would produce a high-dimensional sparse matrix, it is appropriate.


In [31]:
# Sample of data
df = pd.DataFrame({"Score": ["Low", "Low", "Medium", "Medium", "High", "Low", "Medium","High", "Low"]})
print(df)

    Score
0     Low
1     Low
2  Medium
3  Medium
4    High
5     Low
6  Medium
7    High
8     Low


In [32]:
scale_mapper = {"Low":1, "Medium":2, "High":3}
df["Scale"] = df["Score"].replace(scale_mapper)
print(df) # Ordinal encoding 

    Score  Scale
0     Low      1
1     Low      1
2  Medium      2
3  Medium      2
4    High      3
5     Low      1
6  Medium      2
7    High      3
8     Low      1


In [4]:
import pandas as pd
data = {'Category': ['A', 'B', 'A', 'C', 'B']}
df = pd.DataFrame(data)

df_encoded = pd.get_dummies(df, columns=['Category'])
print(df_encoded) # One-hot encoding

   Category_A  Category_B  Category_C
0        True       False       False
1       False        True       False
2        True       False       False
3       False       False        True
4       False        True       False


In [5]:
from sklearn import preprocessing   
obj = preprocessing.LabelEncoder()  


import pandas as pd   
my_data = {  
    "Gender" : ['F', 'M', 'F','M', 'F','M'],  
    "Name" : ['Cindy','Johnny','Sara', 'Victor', 'Martha','Max']  
        }  
blk = pd.DataFrame(my_data)  
print(blk)   # before using label encoding

  Gender    Name
0      F   Cindy
1      M  Johnny
2      F    Sara
3      M  Victor
4      F  Martha
5      M     Max


In [6]:
my_label = preprocessing.LabelEncoder()   
   
blk[ 'Gender' ]= my_label.fit_transform(blk[ 'Gender' ])   
print(blk[ 'Gender' ].unique())  
print( blk )  # after lable encoding 

[0 1]
   Gender    Name
0       0   Cindy
1       1  Johnny
2       0    Sara
3       1  Victor
4       0  Martha
5       1     Max


- One hot encoding is a technique that we use to represent categorical variables as numerical values in a machine learning model.
In One Hot Encoding, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is a Male, the value will be 1 in the Male column and 0 in the Female column, and vice-versa.

###### 3. Removing Duplicates Questions:

- We can identify duplicates by using duplicated method().
- And directly remove by using drop_duplicates() method. This method returns a DataFrame with duplicate rows removed. By default, it keeps the first occurrence of each duplicated row.

In [7]:
import pandas as pd

# Sample DataFrame with duplicate rows
data = {'A': [1, 2, 3, 4, 2],
        'B': ['a', 'b', 'c', 'd', 'b'],
        'C': ['x', 'y', 'z', 'x', 'y']}
df = pd.DataFrame(data)

# Identifying duplicate rows
duplicate_rows = df[df.duplicated()]
print("Duplicate Rows:")
print(duplicate_rows)

# Removing duplicate rows
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame without Duplicates:")
print(df_no_duplicates)

Duplicate Rows:
   A  B  C
4  2  b  y

DataFrame without Duplicates:
   A  B  C
0  1  a  x
1  2  b  y
2  3  c  z
3  4  d  x


-The differene between duplicated() and drop_duplicates():

duplicated() is used for identifying duplicated rows and uses False and True for identification. 

drop_duplicates() is used for removing duplicated rows. It returns a new DataFrame with duplicate rows removed.

###### 4. Data Scaling and Normalization Questions:

Machine learning models can perform much better when features are scaled. Because the various feature scales have no effect on the algorithms, scaling the features facilitates the search for the best answer.

-It facilitates faster algorithm convergence. This is particularly true for gradient descent-based algorithms, as they may optimize the cost function more quickly.

-Feature scaling also helps with interpretability since it brings the magnitude of all features on the same scale. 

-The robustness to outliers can also be increased by feature scaling, however this does not mean that feature scaling should be used as a strategy for dealing with outliers.

###### Min-Max
-One of the most used techniques for normalizing data is min-max normalization. All features have their minimum value converted to a zero, their maximum value converted to a one, and all other values converted to a decimal between 0 and 1.

-This transformation linearly scales each feature, preserving the relative differences between values.

-Min-max scaling is sensitive to outliers since it scales the data based on the range of values, which can be influenced by outliers.

- Z-score

-Z-score normalization is a strategy of normalizing data that avoids this outlier issue.

-This transformation centers the data around the mean and scales it based on the standard deviation.

-Z-score normalization is less sensitive to outliers compared to min-max scaling because it uses the mean and standard deviation, which are less influenced by outliers.

###### Outliers

A stray data point that differs greatly from the majority is called an outlier, and it can affect machine learning models' performance in both good and negative ways.

-1. Model Performance:
Outliers can ruin patterns, so it will give inaccurate data. 

-2. Overfitting:
Because outliers capture noise in the data instead of actual patterns, they can cause overfitting in complex models. When trained on datasets including outliers, models may get attuned to the noise, leading to subpar generalization on unobserved data.

-3. Robustness:
Positively, some algorithms for machine learning are less vulnerable to outliers. Random forests and decision trees, for example, may tolerate outliers well and continue to function well in their presence.


-5. Preprocessing Techniques:
Robust preprocessing procedures are essential to reduce the influence of outliers. These include using robust estimators to impute missing values, altering skewed features to lessen the impact of outliers, and capping or extending outliers to predetermined ranges.



###### Detecting outliers:

- Z-score:
z = (x - mean) / std
-x-data point
-mean-mean of data set
-std - standart deviation 

To identify outliers using the z-score, we can set a threshold value, say 3. Any data point with a z-score greater than 3 or less than -3 can be considered an outlier. We can use the scipy library in Python to calculate the z-score and identify outliers.

The interquartile range (IQR) is a measure of the spread of the middle 50% of the data. The IQR can be calculated as the difference between the 75th percentile and the 25th percentile of the dataset. Any data point outside the range of 1.5 times the IQR below the 25th percentile or above the 75th percentile can be considered an outlier.

In [28]:
import pandas as pd
import numpy as np
 
# Load the Iris dataset
df = pd.DataFrame({"Age": [17, 16, 44, 16, 15, 18, 16, 17, 18]})

 
# Extract the column of interest for outlier detection (e.g., age)
column_name = "Age"
data = df[column_name]
 
# Calculate the Z-scores
z_scores = (data - data.mean()) / data.std()
 
# Define a threshold for identifying outliers (e.g., Z-score threshold of 2)
threshold = 2
 
# Identify the outliers
outliers = df[abs(z_scores) > threshold]
 
# Print the outliers
print("Outliers:")
print(outliers)

Outliers:
   Age
2   44


In [30]:
# calculate IQR for column Height
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

# identify outliers
threshold = 1.5
outliers = df[(df['Age'] < Q1 - threshold * IQR) | (df['Age'] > Q3 + threshold * IQR)]
print(outliers)

   Age
2   44


###### Handling Outliers


- Trimming:

Remove extreme values from the dataset based on a predefined threshold.
It involves discarding data points that exceed a certain percentile or fall outside a specified range.

- Winsorization:

Winsorization replaces extreme values with the values of the nearest non-outlier data points.
It helps reduce the impact of outliers without removing them entirely.

- Robust Scaling:

Use robust scaling techniques such as RobustScaler from scikit-learn to scale the data while mitigating the influence of outliers.
Robust scaling scales the data based on robust statistics like median and interquartile range (IQR).

- Imputation:

Replace outlier values with more representative values using imputation techniques such as median or mean imputation.
Imputation can help retain data integrity while mitigating the influence of outliers on the analysis.
