$Step-1$:
    
**Import required packages**

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')  #this step prevents warnings display

$Step-2$

**Read the data**

In [3]:
# read the dataset 

df = pd.read_csv("winequality_red.csv")
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


- check for any empty rows while reading (blank rows) if csv file first try to open in a csv opening app.
- see if any empty columns exists. these are not not missing values.
- dropping those null value/empty rows. sometimes occurs while reading a file. 
- optional step if only empty rows exists
    - df=df.dropna()
            - to reset indexes after dropping empty rows if present
    - df.reset_index(inplace=True) 
    - df

In [4]:
df=df.dropna()
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
df.shape

(1599, 12)

In [6]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [7]:
df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [8]:
df.dtypes

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
dtype: object

- We divide data into two parts i.e input data and output data

- input data = X; output data=y

- Again we divide input data into two parts i.e train and test

- input train data= x_train; input test data= x_test

- similarly we divide output data into two parts i.e train and test

- output train data= y_train; output test data= y_test

- Model development happens on train data i.e x_train and y_train

- Model will predict by passing x_test data, these are called y_predictions

- y_predictions will compare with y_test , this is called test accuracy/ test error

In [9]:
#x_train   y_train
#1           1
#2           4
#3           9
#4           16

#x_test    y_test
#5         25

#develop a model (1,1) (2,4) (3,9) (4,16)
#model will predict by passing 5 , y_predictions  we need to compare with y_test 

$Step-3$

**divide data into input and output data**

In [10]:
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [11]:
X=df.drop('quality',axis=1)
y=df['quality']

$Step-4$

**Divide data into train and test**

- Will use train test split from sklearn model selection

- It will take following parameters

    - X: input data
    
    - y: output data
    
    - test size = 0.3, 30% test data 70% train data
    
    - by default 75:25
    
    - random state
    
        - we want select observations randomly
        
        - many possibile combinations
        
        - Every possibile combination represent as a number
        
        - 1,2,3,4,5,6,7,8,9,10
        
        - Select 5 numbers randomly : how many combiantions
        
        - random state=42 
        
        - random state= 1234

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,  # Input data
                                                  y,  # output data
                                                  random_state=1234, # it select random samples
                                                  test_size=0.30)


**Check point-1**

$Check$ $the$ $shape$

In [13]:
#df.shape
#X:11   y:1
#1599 = 100
#?    = 30

#1599*30/100

#test data has  480
#train data has 1599-480=1119 

#df      : 12  column  1599 rows   (1599,12)
#X_train : 11 columns  1119 rows   (1119,11)
#X_test  : 11 columns   480 rows   (480,11)
#y_train :  1  column  1119 rows   (1119,)
#y_test  :  1  column   480 rows   (480,)


# if you seee shape  (rows,columns) (480,1) === data farme
# if you see  shape  (rows,)        (480,)  === series

In [14]:
X_train.shape, X_test.shape

((1119, 11), (480, 11))

In [15]:
y_train.shape, y_test.shape

((1119,), (480,))

In [16]:
print("the shape of data frame is:",df.shape)
print("the shape of X_train is:",X_train.shape)
print("the shape of X_test frame is:",X_test.shape)
print("the shape of y_train frame is:",y_train.shape)
print("the shape of y_test frame is:",y_test.shape)

the shape of data frame is: (1599, 12)
the shape of X_train is: (1119, 11)
the shape of X_test frame is: (480, 11)
the shape of y_train frame is: (1119,)
the shape of y_test frame is: (480,)


**Check point-2**:
    
- Check the observation indexes of X_train and y_train both should match
    
- similarly X_test and y_test both should match

In [17]:
X_train  

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
642,9.9,0.540,0.45,2.3,0.071,16.0,40.0,0.99910,3.39,0.62,9.4
678,8.3,0.780,0.10,2.6,0.081,45.0,87.0,0.99830,3.48,0.53,10.0
412,7.1,0.735,0.16,1.9,0.100,15.0,77.0,0.99660,3.27,0.64,9.3
73,8.3,0.675,0.26,2.1,0.084,11.0,43.0,0.99760,3.31,0.53,9.2
985,7.4,0.580,0.00,2.0,0.064,7.0,11.0,0.99562,3.45,0.58,11.3
...,...,...,...,...,...,...,...,...,...,...,...
1228,5.1,0.420,0.00,1.8,0.044,18.0,88.0,0.99157,3.68,0.73,13.6
1077,8.6,0.370,0.65,6.4,0.080,3.0,8.0,0.99817,3.27,0.58,11.0
1318,7.5,0.630,0.27,2.0,0.083,17.0,91.0,0.99616,3.26,0.58,9.8
723,7.1,0.310,0.30,2.2,0.053,36.0,127.0,0.99650,2.94,1.62,9.5


In [18]:
y_train

642     5
678     5
412     5
73      4
985     6
       ..
1228    7
1077    5
1318    6
723     5
815     5
Name: quality, Length: 1119, dtype: int64

In [19]:
X_test

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
688,7.7,0.660,0.04,1.6,0.039,4.0,9.0,0.99620,3.40,0.47,9.4
961,7.1,0.560,0.14,1.6,0.078,7.0,18.0,0.99592,3.27,0.62,9.3
726,8.1,0.720,0.09,2.8,0.084,18.0,49.0,0.99940,3.43,0.72,11.1
537,8.1,0.825,0.24,2.1,0.084,5.0,13.0,0.99720,3.37,0.77,10.7
1544,8.4,0.370,0.43,2.3,0.063,12.0,19.0,0.99550,3.17,0.81,11.2
...,...,...,...,...,...,...,...,...,...,...,...
1461,6.2,0.785,0.00,2.1,0.060,6.0,13.0,0.99664,3.59,0.61,10.0
1591,5.4,0.740,0.09,1.7,0.089,16.0,26.0,0.99402,3.67,0.56,11.6
1045,6.9,0.440,0.00,1.4,0.070,32.0,38.0,0.99438,3.32,0.58,11.4
1498,6.6,0.895,0.04,2.3,0.068,7.0,13.0,0.99582,3.53,0.58,10.8


In [20]:
y_test

688     5
961     5
726     6
537     6
1544    7
       ..
1461    4
1591    6
1045    6
1498    6
42      6
Name: quality, Length: 480, dtype: int64

In [21]:
df.reset_index(drop=True)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [22]:
############## All together ##############
#Step-1: Import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Step-2:  Read the data
df = pd.read_csv("C:\\Users\\omkar\\OneDrive\\Documents\\Data science\\Naresh IT\\Datafiles\\winequality_red.csv")
df.head()

# Step-3(Optional): If you are seeing any duplicates
#         or data has empty rows when you read it display as NuLL  or NaN
#         if you are seeing this check it is really a missing values  
#         or  data has empty rows (alternative rows)
df.dropna(inplace=True)
df.reset_index(inplace=True)

##################### EDA ##################
#Numerical data  shoudl be ready  before go to the step-4 
############################################# 

# Step-4: Divide into X and y
X=df.drop('quality',axis=1)
y=df['quality']

# Step-5: Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,  # Input data
                                                  y,  # output data
                                                  random_state=1234, # it select random samples
                                                  test_size=0.30)


FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\omkar\\OneDrive\\Documents\\Data science\\Naresh IT\\Datafiles\\winequality_red.csv'

$Step-5$

**Model development**

In [23]:
# Model development happens using train data
# X_train    y_train
#from sklearn.linear_model import LinearRegression
#LR=LinearRegression()




In [24]:
X_train.ndim
# 1 dimension means 1 column only
# 2 dimension means 2 column only
# when you have only 1 coulmn, the shape will not show the column
# (21,) it is only one column data having 21 observations
# (9,) it is one column data having 9 observation
# (30,2) it is 2 column data having 30 observation
# Reshape the data if you have only one column

2

- import the package

- save the pacakge

- apply fit transform

following are few of the the respective classes for diiferent  respective ML algorithms.

- from sklearn.preprocessing import LabelEncoder,StandardScaler,MinMaxScaler
- from sklearn.model_selection import train_test_split
- from sklearn.linear_model import LinearRegression
- from sklearn.tree import DecisionTreeClassifier
- from sklearn.naive_bayes
- from sklearn.neighbors
- from sklearn.ensemble


In [25]:
from sklearn.linear_model import LinearRegression
LR=LinearRegression()
LR.fit(X_train,y_train)

$Step-6$

**Model predictions**

we need to pass X_test data

In [26]:
# Model predictions happens X_test
y_predictions=LR.predict(X_test)

In [27]:
y_predictions

array([5.15272674, 5.32546327, 5.63662519, 5.51245819, 6.35588496,
       5.39028618, 5.37580837, 5.94379165, 5.49446987, 6.98211458,
       4.92847919, 5.18177228, 5.73271386, 5.56085417, 5.53985028,
       5.25272042, 4.93424659, 5.12423646, 6.27626835, 5.12589226,
       5.06691612, 5.91039778, 5.1797551 , 5.45854364, 5.72531378,
       6.31002431, 5.30913419, 5.50678991, 6.04808131, 5.47609822,
       4.72291628, 6.20818055, 5.21920961, 5.62421184, 5.55062526,
       5.52259916, 6.36735043, 5.23288886, 5.49938736, 5.80792898,
       5.16587916, 6.13318292, 5.72636822, 5.33537114, 5.63701918,
       4.96463531, 4.88029661, 5.78938553, 6.36333952, 5.56149953,
       4.95005899, 5.75085354, 6.67236186, 6.16951277, 5.34056761,
       5.12096361, 6.19641543, 5.21584694, 6.46449078, 5.25590493,
       5.27667097, 5.17871421, 5.09122333, 5.63083824, 5.01946398,
       6.16336001, 5.7427919 , 6.337122  , 5.28732968, 6.05063646,
       5.7342916 , 5.57340483, 5.21043842, 6.18633591, 5.50678

$Step-7$

**Model evaluation**

In [28]:
# RMSE
# MSE
# MAE
# R-square

from sklearn.metrics import r2_score,mean_squared_error

In [29]:
R2=r2_score(y_test,y_predictions)
MSE=mean_squared_error(y_test,y_predictions)
#MSE**(1/2)
RMSE=np.sqrt(MSE)
#accuracy_score(y_test,y_predictions) # it is a regression tech
print("R-sqaure:",R2)
print("MSE:",MSE)
print("RMSE:",RMSE)

R-sqaure: 0.3128586279075245
MSE: 0.39185847231164256
RMSE: 0.6259860000923684


In [30]:
# Suppose your original salary is 50k
# Our model will expecting either 44k  or  56k 

$Step-8$

**Finding coeffiecnt and Intercept**

- Coefficient means b0 ,b1 ,b2....

- Coeffiecints depends on number of input features

- In this data we have only one column as input i.e. Years of Experience

- So we will get only one coeffiecnt

In [31]:
LR.coef_
print("The coeffiecnt of Years_of_experience is:",LR.coef_)

The coeffiecnt of Years_of_experience is: [ 2.86163946e-02 -1.17907625e+00 -2.55897983e-01  4.14316493e-03
 -1.54784393e+00  6.49596171e-03 -3.95580696e-03 -1.20899746e+01
 -4.30288140e-01  9.60801101e-01  2.93739795e-01]


In [32]:
LR.intercept_

16.07844379344572

In [33]:
X_train.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')

In [34]:
#Regression_equation=LR.intercept_+LR.coef_ * col namee
#Regression_equation

y=16.07+2.86163946e-02*fixed acidity+-1.17907625e+00*volatile acidity

SyntaxError: invalid syntax (1563545656.py, line 4)

$Step-9$

**Plot the regression line**

- In order to plot regression line

- We need to undertsand the two plots

- Orginal data plot i.e input data(X) vs output data(y)

- Regression plot i.e input data (X) vs predictions of regression model by passing input data (X)

In [35]:
# Draw the regression line on original data vs predictions on original data

#original_y_predictions=LR.predict(X.array.reshape(-1,1))
#plt.scatter(X,y,label='original data')  # Original plot
#plt.plot(X,original_y_predictions,color='red') # Regression plot

$Step-10$

**Stas.OLS method**

In [36]:
from statsmodels.api import OLS
OLS(y_train,X_train).fit().summary()

0,1,2,3
Dep. Variable:,quality,R-squared (uncentered):,0.987
Model:,OLS,Adj. R-squared (uncentered):,0.987
Method:,Least Squares,F-statistic:,7512.0
Date:,"Tue, 26 Mar 2024",Prob (F-statistic):,0.0
Time:,08:56:30,Log-Likelihood:,-1115.2
No. Observations:,1119,AIC:,2252.0
Df Residuals:,1108,BIC:,2308.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
fixed acidity,0.0133,0.020,0.664,0.507,-0.026,0.053
volatile acidity,-1.1899,0.142,-8.354,0.000,-1.469,-0.910
citric acid,-0.2557,0.178,-1.434,0.152,-0.606,0.094
residual sugar,-0.0025,0.014,-0.178,0.859,-0.031,0.025
chlorides,-1.5735,0.494,-3.182,0.002,-2.544,-0.603
free sulfur dioxide,0.0067,0.003,2.530,0.012,0.001,0.012
total sulfur dioxide,-0.0040,0.001,-4.260,0.000,-0.006,-0.002
density,4.3073,0.768,5.606,0.000,2.800,5.815
pH,-0.5074,0.196,-2.587,0.010,-0.892,-0.123

0,1,2,3
Omnibus:,13.594,Durbin-Watson:,2.099
Prob(Omnibus):,0.001,Jarque-Bera (JB):,16.983
Skew:,-0.165,Prob(JB):,0.000205
Kurtosis:,3.506,Cond. No.,2410.0


In [37]:
## All together

################################## Data into two parts############################################
X=df['YearsExperience']  
y=df['Salary']


################################ Train test split #################################################
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,  # Input data
                                                  y,  # output data
                                                  random_state=1234, # it select random samples
                                                  test_size=0.30)

#########################Model predictions happens X_test############################################
y_predictions=LR.predict(X_test.array.reshape(-1, 1))


######################### Metrics######################################################################

from sklearn.metrics import r2_score,mean_squared_error
R2=r2_score(y_test,y_predictions)
MSE=mean_squared_error(y_test,y_predictions)
RMSE=np.sqrt(MSE)
#accuracy_score(y_test,y_predictions) # it is a regression tech
print("R-sqaure:",R2)
print("MSE:",MSE)
print("RMSE:",RMSE)

KeyError: 'YearsExperience'

$Step-11$:
    
**Save the model**

In [38]:
import pickle
pickle.dump(LR,
            open('linear_wine_model.pkl','wb'))

#Model name=LR
#In which name the model is saving: linear_slaary_model
# extenstion: Pickle
# wb: write in bytes

$Step-12$:

**Load the model**

In [39]:
# Loading model to compare the results
model = pickle.load(open('linear_wine_model.pkl','rb'))
model

$Step-13$:
    
**Predictions**

In [40]:
X_test

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
688,7.7,0.660,0.04,1.6,0.039,4.0,9.0,0.99620,3.40,0.47,9.4
961,7.1,0.560,0.14,1.6,0.078,7.0,18.0,0.99592,3.27,0.62,9.3
726,8.1,0.720,0.09,2.8,0.084,18.0,49.0,0.99940,3.43,0.72,11.1
537,8.1,0.825,0.24,2.1,0.084,5.0,13.0,0.99720,3.37,0.77,10.7
1544,8.4,0.370,0.43,2.3,0.063,12.0,19.0,0.99550,3.17,0.81,11.2
...,...,...,...,...,...,...,...,...,...,...,...
1461,6.2,0.785,0.00,2.1,0.060,6.0,13.0,0.99664,3.59,0.61,10.0
1591,5.4,0.740,0.09,1.7,0.089,16.0,26.0,0.99402,3.67,0.56,11.6
1045,6.9,0.440,0.00,1.4,0.070,32.0,38.0,0.99438,3.32,0.58,11.4
1498,6.6,0.895,0.04,2.3,0.068,7.0,13.0,0.99582,3.53,0.58,10.8


In [41]:
X_test.values

array([[ 7.7  ,  0.66 ,  0.04 , ...,  3.4  ,  0.47 ,  9.4  ],
       [ 7.1  ,  0.56 ,  0.14 , ...,  3.27 ,  0.62 ,  9.3  ],
       [ 8.1  ,  0.72 ,  0.09 , ...,  3.43 ,  0.72 , 11.1  ],
       ...,
       [ 6.9  ,  0.44 ,  0.   , ...,  3.32 ,  0.58 , 11.4  ],
       [ 6.6  ,  0.895,  0.04 , ...,  3.53 ,  0.58 , 10.8  ],
       [ 7.5  ,  0.49 ,  0.2  , ...,  3.21 ,  0.9  , 10.5  ]])

In [42]:
len(X_test.columns)

11

In [43]:
model.predict([[1,2,3,4,5,6,7,8,9,10,11]])
# the input columns are 11
# so we need pass 11 values as list

array([-82.48338949])

In [44]:
model.predict(X_test)

array([5.15272674, 5.32546327, 5.63662519, 5.51245819, 6.35588496,
       5.39028618, 5.37580837, 5.94379165, 5.49446987, 6.98211458,
       4.92847919, 5.18177228, 5.73271386, 5.56085417, 5.53985028,
       5.25272042, 4.93424659, 5.12423646, 6.27626835, 5.12589226,
       5.06691612, 5.91039778, 5.1797551 , 5.45854364, 5.72531378,
       6.31002431, 5.30913419, 5.50678991, 6.04808131, 5.47609822,
       4.72291628, 6.20818055, 5.21920961, 5.62421184, 5.55062526,
       5.52259916, 6.36735043, 5.23288886, 5.49938736, 5.80792898,
       5.16587916, 6.13318292, 5.72636822, 5.33537114, 5.63701918,
       4.96463531, 4.88029661, 5.78938553, 6.36333952, 5.56149953,
       4.95005899, 5.75085354, 6.67236186, 6.16951277, 5.34056761,
       5.12096361, 6.19641543, 5.21584694, 6.46449078, 5.25590493,
       5.27667097, 5.17871421, 5.09122333, 5.63083824, 5.01946398,
       6.16336001, 5.7427919 , 6.337122  , 5.28732968, 6.05063646,
       5.7342916 , 5.57340483, 5.21043842, 6.18633591, 5.50678

In [45]:
import os
os.getcwd()

'/Users/jagathyrav/Naresh IT/Data Science/Anaconda JN/EDA'

In [46]:
X.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')

In [None]:
input_cols=['fixed acidity', 'volatile acidity', 'citric acid', 
                'residual sugar','chlorides', 'free sulfur dioxide', 
                'total sulfur dioxide', 'density',
                'pH', 'sulphates', 'alcohol']

#val1=input()
#val2=input()
#val11=input()
#list1=[v1,v2]

list1=[]
for i in input_cols:
    val=input()
    list1.append(eval(val))

In [None]:
list1

In [None]:
model.predict([list1])