<a href="https://colab.research.google.com/github/Ma7moudYasser/MachineLearningForEru/blob/main/preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Preprocessing


### Data cleaning

In [3]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Sample data with missing values and outliers
data = {
    'A': [1, 2, np.nan, 4, 100],  # 100 is an outlier
    'B': [5, np.nan, 7, 8, 9],
    'C': [10, 11, 12, np.nan, 14]
}

df = pd.DataFrame(data)
print("Original data:")
print(df)

# Handling missing values
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("\nData after imputing missing values:")
print(df_imputed)

# Handling outliers (using IQR method)
Q1 = df_imputed['A'].quantile(0.25)
Q3 = df_imputed['A'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print("\nOutlier bounds:")
print(lower_bound)
print(upper_bound)

df_clean = df_imputed[(df_imputed['A'] >= lower_bound) & (df_imputed['A'] <= upper_bound)]

print("\nData after removing outliers:")
print(df_clean)

# Remove duplicates
df_clean = df_clean.drop_duplicates()

print("\nFinal cleaned data:")
print(df_clean)

Original data:
       A    B     C
0    1.0  5.0  10.0
1    2.0  NaN  11.0
2    NaN  7.0  12.0
3    4.0  8.0   NaN
4  100.0  9.0  14.0

Data after imputing missing values:
        A     B      C
0    1.00  5.00  10.00
1    2.00  7.25  11.00
2   26.75  7.00  12.00
3    4.00  8.00  11.75
4  100.00  9.00  14.00

Outlier bounds:
-35.125
63.875

Data after removing outliers:
       A     B      C
0   1.00  5.00  10.00
1   2.00  7.25  11.00
2  26.75  7.00  12.00
3   4.00  8.00  11.75

Final cleaned data:
       A     B      C
0   1.00  5.00  10.00
1   2.00  7.25  11.00
2  26.75  7.00  12.00
3   4.00  8.00  11.75


## Data Integration

In [4]:
import pandas as pd

# Sample datasets
data1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['John', 'Jane', 'Bob'],
    'Age': [25, 30, 35]
})

data2 = pd.DataFrame({
    'ID': [2, 3, 4],
    'City': ['New York', 'London', 'Paris'],
    'Salary': [50000, 60000, 55000]
})

print("Dataset 1:")
print(data1)
print("\nDataset 2:")
print(data2)

# Merge datasets based on 'ID'
merged_data = pd.merge(data1, data2, on='ID', how='outer')

print("\nMerged Dataset:")
print(merged_data)

# Fill missing values
merged_data['City'].fillna('Unknown', inplace=True)
merged_data['Salary'].fillna(merged_data['Salary'].mean(), inplace=True)

print("\nFinal Integrated Dataset:")
print(merged_data)

Dataset 1:
   ID  Name  Age
0   1  John   25
1   2  Jane   30
2   3   Bob   35

Dataset 2:
   ID      City  Salary
0   2  New York   50000
1   3    London   60000
2   4     Paris   55000

Merged Dataset:
   ID  Name   Age      City   Salary
0   1  John  25.0       NaN      NaN
1   2  Jane  30.0  New York  50000.0
2   3   Bob  35.0    London  60000.0
3   4   NaN   NaN     Paris  55000.0

Final Integrated Dataset:
   ID  Name   Age      City   Salary
0   1  John  25.0   Unknown  55000.0
1   2  Jane  30.0  New York  50000.0
2   3   Bob  35.0    London  60000.0
3   4   NaN   NaN     Paris  55000.0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  merged_data['City'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  merged_data['Salary'].fillna(merged_data['Salary'].mean(), inplace=True)


## Data Transformation

In [5]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data
data = {
    'Age': [25, 30, 35, 40, 45],
    'Salary': [50000, 60000, 75000, 90000, 100000],
    'Experience': [1, 3, 5, 7, 10]
}

df = pd.DataFrame(data)
print("Original data:")
print(df)

# Standardization (Z-score normalization)
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

print("\nStandardized data:")
print(df_standardized)

# Min-Max normalization
min_max_scaler = MinMaxScaler()
df_normalized = pd.DataFrame(min_max_scaler.fit_transform(df), columns=df.columns)

print("\nNormalized data:")
print(df_normalized)

# Log transformation (for skewed data)
df['Salary_Log'] = np.log(df['Salary'])

print("\nData with log-transformed Salary:")
print(df)

# Discretization (binning)
df['Age_Group'] = pd.cut(df['Age'], bins=[0, 30, 40, 50], labels=['Young', 'Middle', 'Senior'])

print("\nData with discretized Age:")
print(df)

Original data:
   Age  Salary  Experience
0   25   50000           1
1   30   60000           3
2   35   75000           5
3   40   90000           7
4   45  100000          10

Standardized data:
        Age    Salary  Experience
0 -1.414214 -1.355815   -1.344387
1 -0.707107 -0.813489   -0.704203
2  0.000000  0.000000   -0.064018
3  0.707107  0.813489    0.576166
4  1.414214  1.355815    1.536443

Normalized data:
    Age  Salary  Experience
0  0.00     0.0    0.000000
1  0.25     0.2    0.222222
2  0.50     0.5    0.444444
3  0.75     0.8    0.666667
4  1.00     1.0    1.000000

Data with log-transformed Salary:
   Age  Salary  Experience  Salary_Log
0   25   50000           1   10.819778
1   30   60000           3   11.002100
2   35   75000           5   11.225243
3   40   90000           7   11.407565
4   45  100000          10   11.512925

Data with discretized Age:
   Age  Salary  Experience  Salary_Log Age_Group
0   25   50000           1   10.819778     Young
1   30   60000    