# Q. 2 In Data Science, the ability to preprocess raw data effectively is crucial for
accurate and insightful analysis. The table below consists of both numeric and
categorical features. The numeric features, denoted as NumericFeature1 and
NumericFeature2, contains missing values represented as NaN. The
CategoricalFeature comprises distinct categories represented by alphabetic
characters.
+-----------------+-----------------+-------------------+
| NumericFeature1 | NumericFeature2 | CategoricalFeature|
+-----------------+-----------------+-------------------+
| 1.0 | 7 | A |
| 2.0 | 8 | B |
| NaN | 9 | NaN |
| 4.0 | 10 | A |
| 5.0 | 11 | B |
| 6.0 | 50 | C |
+-----------------+-----------------+-------------------+
Your task is to create a robust data preprocessing pipeline using Python, capable of
handling missing values, standardizing numeric features, and detecting and
removing outliers, thus enhancing the overall quality and integrity of the data. The
pipeline should encompass a range of preprocessing techniques to ensure the
resulting data is of high quality and suitable for subsequent analysis.

In [2]:
import numpy as np
import pandas as pd

In [30]:
df = pd.read_excel("C:\\Users\\gajendra singh\\OneDrive\\Desktop\\pandas\\Customer_Call.xlsx")


In [31]:
df

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,True
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes,False
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,,True
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,True
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No,True
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,True
6,1007,Jeff,Winger,,1209 South Street,No,No,False
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No,False
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,,False
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,True


In [32]:
df[('CustomerID')]

0     1001
1     1002
2     1003
3     1004
4     1005
5     1006
6     1007
7     1008
8     1009
9     1010
10    1011
11    1012
12    1013
13    1014
14    1015
15    1016
16    1017
17    1018
18    1019
19    1020
20    1020
Name: CustomerID, dtype: int64

In [33]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import IsolationForest

# Display the initial dataset
print("Initial Dataset:")
print(df)

Initial Dataset:
    CustomerID First_Name    Last_Name  Phone_Number  \
0         1001      Frodo      Baggins  123-545-5421   
1         1002       Abed        Nadir  123/643/9775   
2         1003     Walter       /White    7066950392   
3         1004     Dwight      Schrute  123-543-2345   
4         1005        Jon         Snow  876|678|3469   
5         1006        Ron      Swanson  304-762-2467   
6         1007       Jeff       Winger           NaN   
7         1008   Sherlock       Holmes  876|678|3469   
8         1009    Gandalf          NaN           N/a   
9         1010      Peter       Parker  123-545-5421   
10        1011    Samwise       Gamgee           NaN   
11        1012      Harry    ...Potter    7066950392   
12        1013        Don       Draper  123-543-2345   
13        1014     Leslie        Knope  876|678|3469   
14        1015       Toby  Flenderson_  304-762-2467   
15        1016        Ron      Weasley  123-545-5421   
16        1017   Michael       

In [34]:
# Drop columns that are not useful
df = df.drop(columns=['Do_Not_Contact', 'Not_Useful_Column'])

# Handle missing values
# For numeric columns, replace missing values with the mean
# For categorical columns, replace missing values with the most frequent value
numeric_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

numeric_imputer = SimpleImputer(strategy='mean')
categorical_imputer = SimpleImputer(strategy='most_frequent')

df[numeric_cols] = numeric_imputer.fit_transform(df[numeric_cols])
df[categorical_cols] = categorical_imputer.fit_transform(df[categorical_cols])


In [35]:
# Standardize numeric features
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Detect and remove outliers using Isolation Forest
outlier_detector = IsolationForest(contamination=0.05)  # Adjust contamination based on your dataset
df['Outlier'] = outlier_detector.fit_predict(df[numeric_cols])


In [36]:
# Keep only non-outliers
df = df[df['Outlier'] == 1]

# Drop the 'Outlier' column as it is no longer needed
df = df.drop(columns=['Outlier'])

# Display the preprocessed dataset
print("\nPreprocessed Dataset:")
print(df)



Preprocessed Dataset:
    CustomerID First_Name    Last_Name  Phone_Number  \
1    -1.497070       Abed        Nadir  123/643/9775   
2    -1.329844     Walter       /White    7066950392   
3    -1.162618     Dwight      Schrute  123-543-2345   
4    -0.995392        Jon         Snow  876|678|3469   
5    -0.828166        Ron      Swanson  304-762-2467   
6    -0.660940       Jeff       Winger  876|678|3469   
7    -0.493714   Sherlock       Holmes  876|678|3469   
8    -0.326489    Gandalf    Skywalker           N/a   
9    -0.159263      Peter       Parker  123-545-5421   
10    0.007963    Samwise       Gamgee  876|678|3469   
11    0.175189      Harry    ...Potter    7066950392   
12    0.342415        Don       Draper  123-543-2345   
13    0.509641     Leslie        Knope  876|678|3469   
14    0.676867       Toby  Flenderson_  304-762-2467   
15    0.844092        Ron      Weasley  123-545-5421   
16    1.011318   Michael         Scott  123/643/9775   
17    1.178544      Clark