"""

@Author: Naveen Madev Naik</br>
@Date: 2024-10-08</br>
@Last Modified by: Naveen Madev Naik</br>
@Last Modified time: 2024-10-08</br>
@Title: Identifying the differnent machine learning types and applying different data preprocessing steps like handling missing data, categorical data, 
        feature scaling etc.</br>

"""

## ML Types

Classify the problems as supervised, unsupervised, reinforcement or semi supervised</br>
a. Spam filtering: Is an email spam or not</br>
b. Given a list of customers and information about them, discover groups of similar users. This knowledge can then be used for targeted marketing</br>
c. Robotics: A robot is in a maze, and it needs to find a way out.</br>
d. Training an AI for a complex game such as Civilization or Dota</br>
e. Anomaly detection: Given measurements from sensors in a manufacturing facility, identify anomalies, i.e. that something is wrong</br>
f. Discover patterns in data such as whenever it rains, people tend to stay indoors. When it is hot, people buy more ice-cream.</br>
g. Given information about a house, predict its price</br>
h. Netflix: Given a user and a movie, predict the rating the user is going to give to the movie</br>
i. Given an image, output which objects are present in the image</br>

### a.Spam filtering: Is an email spam or not

Supervised learning: The model is trained on labeled emails (spam or not spam) and makes predictions for new emails.

### b. Given a list of customers and information about them, discover groups of similar

Unsupervised learning: This is a clustering problem where the goal is to find groups (clusters) in the data without predefined labels.

### c. Robotics: A robot is in a maze, and it needs to find a way out.

Reinforcement learning: The robot learns from interactions with its environment, receiving rewards or penalties for actions as it navigates the maze.

### d. Training an AI for a complex game such as Civilization or Dota

Reinforcement learning: The AI learns through trial and error, improving its strategies based on rewards or penalties from the game environment.

### e. Anomaly detection: Given measurements from sensors in a manufacturing facility, identify anomalies, i.e. that something is wrong

Unsupervised learning: Anomalies are identified without prior labels, based on unusual patterns or deviations from normal behavior.

### f. Discover patterns in data such as whenever it rains, people tend to stay indoors. When it is hot, people buy more ice-cream.

Unsupervised learning: This involves finding patterns or correlations in the data without labeled outcomes.

### g. Given information about a house, predict its price

Supervised learning: The model is trained on labeled data (house features and their prices) to predict the price of new houses.

### h. Netflix: Given a user and a movie, predict the rating the user is going to give to the movie

Supervised learning: The model predicts ratings based on historical data of users, movies, and their previous ratings.

### i. Given an image, output which objects are present in the image

Supervised learning: The model is trained on labeled images (with objects identified) and classifies objects in new images.

## Data Preprocessing Steps

1. Apply following steps to dataset given in a url</br>
link: https://drive.google.com/open?id=1NKMy-zIT3tfpNLnA7G0EmPxgZe0OPXp_</br>
a. Handling missing data</br>
b. Handling categorical data</br>
c. Split the dataset into training set and test set</br>
d. Feature scaling</br>

In [16]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

### Loading Data

In [6]:
#Loading the data
df = pd.read_csv("C:/Users/naikn/OneDrive/Documents/python/Machine_learning/data_preprocessing.csv")

print(df)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


### Handling Data

In [8]:
#Handling missing data

# Handling missing values in 'Age' and 'Salary' columns
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

print(df)

   Country        Age        Salary Purchased
0   France  44.000000  72000.000000        No
1    Spain  27.000000  48000.000000       Yes
2  Germany  30.000000  54000.000000        No
3    Spain  38.000000  61000.000000        No
4  Germany  40.000000  63777.777778       Yes
5   France  35.000000  58000.000000       Yes
6    Spain  38.777778  52000.000000        No
7   France  48.000000  79000.000000       Yes
8  Germany  50.000000  83000.000000        No
9   France  37.000000  67000.000000       Yes


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].mean(), inplace=True)


### Handling Categorical data

In [17]:
#Handling Categorical data

# Initialize the LabelEncoder
label_encoder = LabelEncoder()

# Apply the label encoder to the 'Purchased' and 'Country' column
df['Purchased'] = label_encoder.fit_transform(df['Purchased'])
df['Country'] = label_encoder.fit_transform(df['Country'])


In [18]:
print(df)

   Country        Age        Salary  Purchased
0        0  44.000000  72000.000000          0
1        2  27.000000  48000.000000          1
2        1  30.000000  54000.000000          0
3        2  38.000000  61000.000000          0
4        1  40.000000  63777.777778          1
5        0  35.000000  58000.000000          1
6        2  38.777778  52000.000000          0
7        0  48.000000  79000.000000          1
8        1  50.000000  83000.000000          0
9        0  37.000000  67000.000000          1


### Spliting Data 

In [19]:
# Split the dataset into training set and test set
X = df.drop('Purchased', axis=1)  # Features (all columns except 'Purchased')
y = df['Purchased']  # Target variable (the 'Purchased' column)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output shapes
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(8, 3) (2, 3) (8,) (2,)


### Scaling Data

In [20]:
# Apply Feature Scaling to 'Age' and 'Salary'

scaler = StandardScaler()

# We fit the scaler on the training data and transform both training and test data
X_train[['Age', 'Salary']] = scaler.fit_transform(X_train[['Age', 'Salary']])
X_test[['Age', 'Salary']] = scaler.fit_transform(X_test[['Age', 'Salary']])

In [21]:

# Display the scaled training and test sets
print("Scaled Training Data (X_train):")
print(X_train)

Scaled Training Data (X_train):
   Country       Age    Salary
5        0 -0.752943 -0.626038
0        0  1.008454  1.013043
7        0  1.791297  1.832583
2        1 -1.731496 -1.094347
9        0 -0.361521  0.427657
4        1  0.225611  0.050408
3        2 -0.165810 -0.274806
6        2 -0.013591 -1.328501


In [22]:
print("\nScaled Test Data (X_test):")
print(X_test)


Scaled Test Data (X_test):
   Country       Age    Salary
8        1  2.182718  2.300892
1        2 -2.318628 -1.796810
