# Regression and Linear Models
* Linear Regression
* Generalized Linear Models
* Generalized Estimating Equations
* Generalized Additive Models (GAM)
* Robust Linear Models
* Linear Mixed Effects Models
* Regression with Discrete Dependent Variable
* Generalized Linear Mixed Effects Models
* ANOVA
* Other Models othermod


# Time Series Analysis
* Time Series analysis tsa
* Time Series Analysis by State Space Methods statespace
* Vector Autoregressions tsa.vector_ar


# Other Models
* Methods for Survival and Duration Analysis
* Nonparametric Methods nonparametric
* Generalized Method of Moments gmm
* Other Models miscmodels
* Multivariate Statistics multivariate

* endog : dependent variable(y)
* exog : independent variables(x)

# Statsmodels Tools 


* Regression and Linear Models


* Time Series Analysis


* Statistics and Tools

In [1]:
# import the statsmodels and required library

import pandas as pd

import statsmodels

import statsmodels.api as sm  # use for multiple independent variables and useful in large data sets

import statsmodels.formula.api as smf  # use for single independent variable

from sklearn.model_selection import train_test_split

import warnings   # use to ignore warnings while working
warnings.filterwarnings("ignore")

In [2]:
# check the version

statsmodels.__version__

'0.13.2'

In [3]:
df = pd.read_csv(r'C:\Users\HP\Desktop\Data Science\Data Sets For Practice\carprices1.csv')
df

Unnamed: 0,Mileage,Age(yrs),Sell Price($)
0,69000,6,18000
1,35000,3,34000
2,57000,5,26100
3,22500,2,40000
4,46000,4,31500
5,59000,5,26750
6,52000,5,32000
7,72000,6,19300
8,91000,8,12000
9,67000,6,22000


In [4]:
x = df.drop(['Sell Price($)'],axis=1)
x.head()

Unnamed: 0,Mileage,Age(yrs)
0,69000,6
1,35000,3
2,57000,5
3,22500,2
4,46000,4


In [5]:
y = pd.DataFrame(df['Sell Price($)'])
y.head()

Unnamed: 0,Sell Price($)
0,18000
1,34000
2,26100
3,40000
4,31500


In [6]:
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=0)

# Approach :
 

* First we define the variables x and y. In the example below, the variables are read from a csv file using pandas. 
 
 
* Next, We need to add the constant b0  to the equation using the add_constant() method.


* The OLS() function of the statsmodels.api module is used to perform OLS regression. It returns an OLS object. Then fit() method is called on this object for fitting the regression line to the data.


* The summary() method is used to obtain a table which gives an extensive description about the regression results
 

# Type of Linear Regression methods in statsmodels :- 


* OLS : Fit a linear model using Ordinary Least Squares.


* WLS : Fit a linear model using Weighted Least Squares.


* GLS : Fit a linear model using Generalized Least Squares.

# Note :-


* When we have single independent variable in model building we use smf.OLS()
* When we have 2 or more than 2 independet variables in model building we use sm.OLS()

# Ordinary Least Squares(OLS) method of linear regression 

In [7]:
X_train = sm.add_constant(X_train)

In [8]:
lm = sm.OLS(y_train,X_train).fit() # reverse order in training data

In [9]:
lm.params  # It shows the coefficient and intercept of X_train

const       46498.422248
Mileage        -0.296237
Age(yrs)     -630.044653
dtype: float64

In [10]:
lm.pvalues  # It shows the p values of features of X_train

const       7.134096e-11
Mileage     3.694625e-02
Age(yrs)    6.762326e-01
dtype: float64

In [11]:
print(lm.summary())

                            OLS Regression Results                            
Dep. Variable:          Sell Price($)   R-squared:                       0.913
Model:                            OLS   Adj. R-squared:                  0.897
Method:                 Least Squares   F-statistic:                     57.54
Date:                Sun, 19 Feb 2023   Prob (F-statistic):           1.49e-06
Time:                        16:32:48   Log-Likelihood:                -126.17
No. Observations:                  14   AIC:                             258.3
Df Residuals:                      11   BIC:                             260.3
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        4.65e+04   1928.513     24.111      0.0

# Note:- 


* Remember that your R-squared is always greater than Adj. R-squared. It is a sign of well fit model.


* Otherwise you have to reselect your features and build model again.

In [12]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

model = lr.fit(X_train,y_train)

In [13]:
model.score(X_train,y_train)

0.9127533592803845

The accuracy of ourmodel

# Description of some of the terms in the table : 
 

* R-squared : the coefficient of determination. It is the proportion of the variance in the dependent variable that is predictable/explained

    
* Adj. R-squared : Adjusted R-squared is the modified form of R-squared adjusted for the number of independent variables in the model. Value of adj. R-squared increases, when we include extra variables which actually improve the model.

    
* F-statistic : the ratio of mean squared error of the model to the mean squared error of residuals. It determines the overall significance of the model.

    
* coef : the coefficients of the independent variables and the constant term in the equation.

    
* t : the value of t-statistic. It is the ratio of the difference between the estimated and hypothesized value of a parameter, to the standard error

# Logistic Regression using Statsmodels

# Building the Logistic Regression model :
Statsmodels is a Python module that provides various functions for estimating different statistical models and performing statistical tests  

* First, we define the set of dependent(y) and independent(X) variables. If the dependent variable is in non-numeric form, it is first converted to numeric using dummies.


* Statsmodels provides a Logit() function for performing logistic regression. The Logit() function accepts y and X as parameters and returns the Logit object. The model is then fitted to the data.

# Note :-

* When we have 2 categories in target variable we use sm.Logit()


* When we have more than 2 categories in target variable we use sm.GLM()

In [14]:
df = pd.read_csv("iris.csv")
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [15]:
df.Species.value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

In [16]:
df.replace({'Species':{'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}},inplace=True)

In [17]:
df.drop('Id',inplace=True,axis=1)

In [18]:
df.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [19]:
# defining the dependent and independent variables

x = df[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]
y = df[['Species']]

In [20]:
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=0)

In [21]:
X_train = sm.add_constant(X_train)

In [22]:
model = sm.GLM(y_train, X_train).fit()

In [23]:
print(model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                Species   No. Observations:                  105
Model:                            GLM   Df Residuals:                      100
Model Family:                Gaussian   Df Model:                            4
Link Function:               identity   Scale:                        0.044435
Method:                          IRLS   Log-Likelihood:                 17.044
Date:                Sun, 19 Feb 2023   Deviance:                       4.4435
Time:                        16:32:49   Pearson chi2:                     4.44
No. Iterations:                     3   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.3523      0.226      1.561

In [24]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

model = lr.fit(X_train,y_train)

In [25]:
model.score(X_train,y_train)

0.9809523809523809

As you see the accuracy of our model.