<a href="https://colab.research.google.com/github/JapiKredi/Pinnacle_AI_program_AnalyticsVidyha/blob/main/Scikit_Learn_bascis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

<center><h1>📍 📍 Basics of Scikit Learn 📍 📍</h1></center>


---

- It provides simple and efficient tools for pre-processing and predictive modeling



![](images/sklearn.png)

---


***Steps to build a model in scikit-learn.***

---

1. Import the model
2. Prepare the data set
3. Separate the independent and target variables.
4. Create an object of the model
5. Fit the model with the data
6. Use the model to predict target.

In [31]:
# import the scikit-learn library
import sklearn

***If you got an error while running the above cell, import it by using the following command.***

If you are using anaconda with python3: ***`!pip install scikit-learn`***

If you are using jupyter with python3: ***`!pip3 install scikit-learn`***

---

In [32]:
# check the version
sklearn.__version__

'1.6.0'

- ***We have seen in the pandas notebook, that we have some missing values in out data.***
- ***We will impute those missing values using the scikit-learn Imputer.***

---

In [33]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [34]:
!pwd

/content


In [35]:
import pandas as pd

# Construct the full path to the file
file_path = '/content/drive/My Drive/Scikitlearn-200323-233711/Dataset/big_mart_sales.csv'

# Read the CSV file into a pandas DataFrame
data = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(data.head())

  Item_Identifier  Item_Weight Item_Fat_Content  Item_Visibility  \
0           FDA15         9.30          Low Fat         0.016047   
1           DRC01         5.92          Regular         0.019278   
2           FDN15        17.50          Low Fat         0.016760   
3           FDX07        19.20          Regular         0.000000   
4           NCD19         8.93          Low Fat         0.000000   

               Item_Type  Item_MRP Outlet_Identifier  \
0                  Dairy  249.8092            OUT049   
1            Soft Drinks   48.2692            OUT018   
2                   Meat  141.6180            OUT049   
3  Fruits and Vegetables  182.0950            OUT010   
4              Household   53.8614            OUT013   

   Outlet_Establishment_Year Outlet_Size Outlet_Location_Type  \
0                       1999      Medium               Tier 1   
1                       2009      Medium               Tier 3   
2                       1999      Medium               Tier

In [36]:
# read the data set and check for thre null values
data.isna().sum()

Unnamed: 0,0
Item_Identifier,0
Item_Weight,1463
Item_Fat_Content,0
Item_Visibility,0
Item_Type,0
Item_MRP,0
Outlet_Identifier,0
Outlet_Establishment_Year,0
Outlet_Size,2410
Outlet_Location_Type,0


In [37]:
# import the SimpleImputer
from sklearn.impute import SimpleImputer

---

- For imputing the missing values, we will use `SimpleImputer`.
- First we will create an object of the Imputer and define the strategy.
- We will impute the Item_Weight by `mean` value and Outlet_Size by `most_fequent` value.
- Fit the objects with the data.
- Transform the data

---

In [38]:
# create the object of the imputer for Item_Weight and Outlet_Size
impute_weight = SimpleImputer(strategy= 'mean')
impute_size   = SimpleImputer(strategy= 'most_frequent')

In [39]:
# fit the Item_Weight imputer with the data and transform
impute_weight.fit(data[['Item_Weight']])
data.Item_Weight = impute_weight.transform(data[['Item_Weight']])

In [40]:
# fit the Outlet_Size imputer with the data and transform
impute_size.fit(data[['Outlet_Size']])
data.Outlet_Size = impute_size.transform(data[['Outlet_Size']])

ValueError: 2

In [41]:
from sklearn.impute import SimpleImputer
import pandas as pd
import numpy as np

# Sample data with additional columns
data = pd.DataFrame({
    'Item_Identifier': ['FDA15', 'DRC01', 'FDN15', 'FDX07', 'NCD19', 'FDP36'],
    'Item_Weight': [9.3, 5.92, 17.5, 19.2, 8.93, np.nan],
    'Item_Fat_Content': ['Low Fat', 'Regular', 'Low Fat', 'Regular', 'Low Fat', 'Regular'],
    'Outlet_Size': ['Medium', 'Small', None, 'High', 'Medium', None]
})

# Replace None with np.nan
data['Outlet_Size'] = data['Outlet_Size'].replace({None: np.nan})

# Create the imputer
impute_size = SimpleImputer(strategy='most_frequent')

# Fit the imputer with the data and transform only the Outlet_Size column
data['Outlet_Size'] = impute_size.fit_transform(data[['Outlet_Size']]).ravel()

print(data)

  Item_Identifier  Item_Weight Item_Fat_Content Outlet_Size
0           FDA15         9.30          Low Fat      Medium
1           DRC01         5.92          Regular       Small
2           FDN15        17.50          Low Fat      Medium
3           FDX07        19.20          Regular        High
4           NCD19         8.93          Low Fat      Medium
5           FDP36          NaN          Regular      Medium


In [42]:
data.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Outlet_Size
0,FDA15,9.3,Low Fat,Medium
1,DRC01,5.92,Regular,Small
2,FDN15,17.5,Low Fat,Medium
3,FDX07,19.2,Regular,High
4,NCD19,8.93,Low Fat,Medium


In [43]:
# check the null values.
data.isna().sum()

Unnamed: 0,0
Item_Identifier,0
Item_Weight,1
Item_Fat_Content,0
Outlet_Size,0


- ***Now, after the preprocessing step, we separate the independent and target variable and pass the data to the model object to train the model.***
---

- ***If we have a problem in which we have to identify the category of an object based on some features. For example whether the given picture is of a cat or a dog. These are `classification problems`.***
- ***Or, if we have to identify a continous attribute like predicting sales based on some features. These are `Regression Problems`***

---

***`SCIKIT-LEARN` has tools which will help you build Regression, Classification models and many others.***

---

In [44]:
# some of the very basic models scikit learn has.
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier

---

After we have build the model now whenever new data points are added to the existing data, we need to perform the same preprocessing steps again before we can use the model to make predictions. This becomes a tedious and time consuming process!

So, scikit-learn provides tools to create a pipeline of all those steps that will make your work a lot more easier.

---

In [45]:
from sklearn.pipeline import Pipeline

___

***We will study each of the step in detail in the upcoming modules.***

---

---

***Learn more about the scikit-learn here: https://scikit-learn.org/stable/index.html***

---