## For Numerical Data
    Handling missing values :- Fill or remove null/NaN data.
    Normalization / Standardization :- Scale features to same range.
    Encoding categorical variables	:- Convert classes to numbers (One-Hot, Label Encoding).
    Outlier detection	:-Remove extreme values that break learning.
    Balancing dataset	:-Fix class imbalance (oversampling/undersampling).
    Feature selection   :- Remove unnecessary or redundant features. 

In [None]:
Note: This dataset is not ideal for real-world use; it is intended only for learning and understanding the concepts.

In [121]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

# Load DataSet

In [122]:
df= pd.read_csv("DataSet/customer_purchase_preprocessing.csv")  # update the file path if your CSV is stored elsewhere

In [123]:
df.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [124]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    10 non-null     object 
 1   Age        9 non-null      float64
 2   Salary     9 non-null      float64
 3   Purchased  10 non-null     object 
dtypes: float64(2), object(2)
memory usage: 452.0+ bytes


In [125]:
df.duplicated().sum()

np.int64(0)

In [126]:
df.isna().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

# Handle Null value
# 1.Remove Missing Values
# 2.Imputation (Fill Missing Values)

In [None]:
#1.Remove Missing Values
# This will remove those rows which have null value in it 
df=df.dropna()  
    #or
#df.dropna(inplace=True)

# but if let suppose you have a column with to many null value than drop that entire column is the ideal choice 

df.dropna(axis=1)  # but this will drop both Age and Salary Column so To Solve This

#We will use
df.drop('Age', axis=1, inplace=True)
#For Drop multiple columns
df.drop(['Age', 'Salary'], axis=1,inplace=True)


# 2.Imputation (Fill Missing Values)
    1.`Mean :- When values are normally distributed         df['Age'].fillna(df['Age'].mean())
    2. Medain :- When outliers exist (safer)                df['Salary'].fillna(df['Salary'].median())
    3. Mode :- When few unique values                       df['Age'].fillna(df['Age'].mode()[0])

In [128]:
df["Age"]=df["Age"].fillna(df["Age"].mean())

In [129]:
df["Salary"]=df['Salary'].fillna(df["Salary"].median())

## Spliting the dependent And Independant Variable

In [130]:
# Creating The independent Variable Matrix
X=df.iloc[:,:-1]
X

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,61000.0
5,France,35.0,58000.0
6,Spain,38.777778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [131]:
# Creating The dependent Variable Vector
Y=df.iloc[:,-1]
Y

0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object

# Encoding the categorical data

In [132]:
from sklearn.preprocessing import LabelEncoder ,OneHotEncoder 

In [133]:
Label=LabelEncoder()
Y=Label.fit_transform(Y)

In [134]:
# One-Hot Encode the Country column
X= pd.get_dummies(X, columns=['Country'])

In [135]:
X   # Notice here we have applyed One Hot Encoding 
    # due to which a single column is covert into mutiple columns ( depends on number of classes inside the original class i.e France,Germany,Spain)

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,True,False,False
1,27.0,48000.0,False,False,True
2,30.0,54000.0,False,True,False
3,38.0,61000.0,False,False,True
4,40.0,61000.0,False,True,False
5,35.0,58000.0,True,False,False
6,38.777778,52000.0,False,False,True
7,48.0,79000.0,True,False,False
8,50.0,83000.0,False,True,False
9,37.0,67000.0,True,False,False


In [136]:
Y  # 0 = No and 1 = Yes

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

# splitting the train and test data 

In [137]:
from sklearn.model_selection import train_test_split

In [138]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

# Feature Scaling
## apply features scaling after train-test split

In [143]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Select columns (edit these column names to match yours)
numeric_features = ['Age', 'Salary']
categorical_features = ['Country_France', 'Country_Germany', 'Country_Spain']  # your one-hot encoded columns

# Apply scaling only to numeric columns
ct = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features)
    ],
    remainder='passthrough'  # keep categorical columns as they are
)

X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)


# These are the fundamental preprocessing steps. Focus on understanding the theory behind each heading. Once you grasp these concepts, you’ll have a much better understanding of data preparation for machine learning.

# Ignore the Result the data is To small to learn about anything from the data 

In [144]:
from sklearn.linear_model import LinearRegression

In [145]:
model = LinearRegression()
model.fit(X_train, y_train)

In [146]:
# 4. Make predictions
y_pred = model.predict(X_test)

In [147]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_pred = model.predict(X_test)

print("R2 Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("MAE:", mean_absolute_error(y_test, y_pred))


R2 Score: 0.0
MSE: 1.7125528916668489
RMSE: 1.3086454415413096
MAE: 1.2383291022214442
