<a href="https://colab.research.google.com/github/JapiKredi/StatsModels_basics2/blob/main/158StatsModels_230322_061425.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

<center><h1> 📍 📍 StatsModels 📍 📍</h1></center>

---
Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

---

In [None]:
# import the statsmodels library
import statsmodels

***If you got an error while running the above cell, import it by using the following command.***

If you are using anaconda with python3: ***`!pip install statsmodels`***

If you are using jupyter with python3: ***`!pip3 install statsmodels`***

---

In [None]:
# check the version
statsmodels.__version__

'0.14.4'

---

#### `IMPORTING THE REQUIRED LIBRARIES`

---

In [None]:
# statsmodels
import statsmodels.api as sm
from sklearn import datasets
import numpy as np
import pandas as pd

---

#### `LOADING THE DATASET`

---

In [None]:
## loads Boston dataset from datasets library
data = datasets.load_boston()

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


load_boston() function is deprecated. So, using fetch_california_housing() function instead.

In [None]:
## loads california_housing dataset from datasets library
data = datasets.fetch_california_housing()

---

#### `CREATE THE DATAFRAME`

---

In [None]:
# create a dataframe
features = pd.DataFrame(data.data, columns= data.feature_names)

In [None]:
features.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


---

#### `SEPARATE THE TARGET VARIABLE`

---

In [None]:
# target in another DataFrame
target = pd.DataFrame(data.target, columns=["MEDV"])

---

#### `CREATE A LINEAR REGRESSION MODEL`

---

In [None]:
# define X and Y
X = features
y = target["MEDV"]

---

#### `BUILD A MODEL`


---

In [None]:
# Note the difference in argument order
model = sm.OLS(y, X).fit()

---

#### `MAKE PREDICTIONS`

---

In [None]:
predictions = model.predict(X) # make the predictions by the model

---

#### `SUMMARY OF THE MODEL`

---

In [None]:
# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,MEDV,R-squared (uncentered):,0.892
Model:,OLS,Adj. R-squared (uncentered):,0.892
Method:,Least Squares,F-statistic:,21370.0
Date:,"Fri, 27 Dec 2024",Prob (F-statistic):,0.0
Time:,06:23:47,Log-Likelihood:,-24087.0
No. Observations:,20640,AIC:,48190.0
Df Residuals:,20632,BIC:,48250.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
MedInc,0.5135,0.004,120.594,0.000,0.505,0.522
HouseAge,0.0157,0.000,33.727,0.000,0.015,0.017
AveRooms,-0.1825,0.006,-29.673,0.000,-0.195,-0.170
AveBedrms,0.8651,0.030,28.927,0.000,0.806,0.924
Population,7.792e-06,5.09e-06,1.530,0.126,-2.19e-06,1.78e-05
AveOccup,-0.0047,0.001,-8.987,0.000,-0.006,-0.004
Latitude,-0.0639,0.004,-17.826,0.000,-0.071,-0.057
Longitude,-0.0164,0.001,-14.381,0.000,-0.019,-0.014

0,1,2,3
Omnibus:,4353.392,Durbin-Watson:,0.909
Prob(Omnibus):,0.0,Jarque-Bera (JB):,14087.489
Skew:,1.069,Prob(JB):,0.0
Kurtosis:,6.436,Cond. No.,10300.0


---

#### `STATSMODELS TOOLS`

- REGRESSION AND LINEAR MODELS
- TIME SERIES ANALYSIS
- STATISTICS AND TOOLS

---

---


***Learn more about the statsmodels here: https://www.statsmodels.org/stable/gettingstarted.html***
    
    
---    