# **Introduction:**

* statsmodels is a Python library that helps you do statistics and statistical modeling.

* It’s like Excel for serious statistics, but in Python.

* Think of it like a tool that helps you answer questions like:

  > Does studying more hours IMPROVE test scores?

  > How does temperature AFFECT ice cream sales?

  > Can we PREDICT next month’s sales based on past months?

* It can do things like:

  1. Linear regression

  2. Logistic regression

  3. Time series analysis

  4. ANOVA (comparing groups)

  5. Many statistical tests

# **Installing:**

In [1]:
!pip install statsmodels



# **Interfaces inside Statsmodels:**

Statsmodels is the library. Inside it, there are two ways (interfaces) to use it:

1. **Formula API** → a convenient way to tell statsmodels what you want, using formulas like Y ~ X1 + X2. It takes column names from DataFrame. So, in it, write a math-like formula, statsmodels handles columns.

2. **Data API** (Explicit API) → a more manual way, where you give X and Y arrays yourself. It takes arrays/matrices or explicit data. You manually give X and Y arrays, more work, more flexible.

So:

> Statsmodels = the tool/library

> Formula API and Data API = two ways (interfaces) to use that tool

Analogy:

* Statsmodels is like a car. 🚗

* Formula API = automatic transmission (easy, beginner-friendly)

* Data API = manual transmission (more control, but you have to do more yourself)

> formula API and data API me bs itna frk eh k hm dataAPI me khud columns nikaal k datyaframe me se phr dety hain or formula API me hm columns dataframe me e rehny dety hain or model ko naam bta dety hain k yeh yeh cplumn utha lo wo utha leta eh par naam to is me b btany e party hain hmen properly.


Automatic = relative term

* Formula API “automatically” picks columns → matlab: hum manually X aur Y bana ke alag arrays nahi banate, lekin formula me column ke names phir bhi hume specify karne padte hain.

* Agar aapke column names weird hain (jaise X, Y, A1…), aapko formula me wahi names use karne padenge.

* Statsmodels name se hi samajhta hai kaun sa dependent (Y) aur kaun sa independent (X) hai.



# **1️⃣ Formula API (like R):**

* You write a formula like in math:

          Y ~ X1 + X2

1. Y = the thing you want to predict (dependent variable)

2. X1, X2 = the things you think affect Y (independent variables)




## **Formula Explanation:**

This formula 'weight ~ height + age' means:

* weight → what you want to predict (dependent variable, Y)

* ~ → “is modeled as a function of”

* height + age → the things you think affect weight (independent variables, X’s)

<what to predict> ~ <predict using this> + <also using this>

In simple words:

“Weight depends on height and age.”

Statsmodels will read it like:

weight = (some number) * height   **+**   (some number) * age **+ (intercept)**


> Example:

Imagine you want to guess how much a cake will weigh.

Ingredients: flour, sugar, eggs

Formula: cake_weight ~ flour + sugar + eggs

Statsmodels will figure out how much each ingredient affects the weight based on your past cakes.



In [2]:
import statsmodels.formula.api as smf
import pandas as pd

# sample data
data = pd.DataFrame({
    'height': [150, 160, 170, 180],
    'weight': [50, 60, 70, 80],
    'age': [20, 25, 30, 35]
})

# formula API: predict weight from height and age
model = smf.ols('weight ~ height + age', data=data).fit()
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                 weight   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.713e+29
Date:                Tue, 02 Sep 2025   Prob (F-statistic):           3.69e-30
Time:                        11:51:25   Log-Likelihood:                 118.83
No. Observations:                   4   AIC:                            -233.7
Df Residuals:                       2   BIC:                            -234.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0322   1.14e-16  -2.83e+14      0.0

  warn("omni_normtest is not valid with less than 8 observations; %i "


# **Understanding Results:**

* Weight is dependent/target variable. We are using model Ordinary Least Squares (OLS).
* No. Observations means number of instances (rows) used which is 4. We have 3 Coefficients that are (Intercept + height + age).
* Df (Degree of Freedom)
* Residuals = “errors” ya “differences between actual and predicted values”

## **1. Df Residuals:**

basically how many pieces of information are left after fitting the model. mtlb hm kitny elements use krye hain or baki kitny bach rahy. hm 3 use krye hain to predict 1. so Df residuals  = 4-3 = 1 means 1 bach raha eh baki. 3 hmne use krliye hmary data me se. **bold text**
Sirf ek chhoti si clarification: Intercept bhi ek coefficient count hota hai, isliye humne 3 include kiye.

      Df Residuals = Number of observations – Number of coefficients

But, in our result it says 2 here because of tiny dataset & perfect correlation (height & age are almost linearly dependent). Mtlb dono variable height and age dono barh rye hain. same effect day rye hain. Linear regression me, agar 2 variables same info de rahe hain, model nahi bata sakta clearly kaun sa zyada effect kar raha hai. agr aik barh rae hoti aik decrease ya mix trha se to phr keh skty thy k yh dono difffernt hain phr result 4-3 = 1 ata. mgr ab wo dono ko same treat kr raha eh so 4 - 2 (here 1 intercept and 1 heightAndAge).

Basically: there’s barely any “freedom left” to check errors → warns you the dataset is too small.



## **2. Df Model (Degrees of Freedom of Model)**

Df Model = How many predictors the model is actually using to explain Y

Normally, we count number of predictors (height + age) = 2

Here statsmodels shows 1 because:

Height & age are almost perfectly correlated in this tiny dataset

Model can’t separate their effects → effectively only 1 independent direction explains weight

That’s why you see 1 instead of 2 → warning about multicollinearity.

### **Multicollinearity:**

* Jab do ya do se zyada independent variables (X’s) ek dusre ke saath bohot closely related ho jate hain mtlb Multicollinearity = predictors ek dusre se itne related ho gaye ke model ko pata nahi kaun zyada effect kar raha hai

* Height aur age dono same pattern follow kar rahe hain

* Agar model ko “weight ko predict karna hai” → model confuse ho jata hai:

> “Weight height ki wajah se badh raha hai ya age ki wajah se?”

**Effect of multicollinearity:**

* Coefficients become unstable → numbers (b1, b2) ka meaning unclear

* Standard errors become huge → p-values unreliable

* Model can still predict Y, but it can’t tell you clearly which X is more important


# **3. Covariance Type: Nonrobust**

Jab regression me cov_type = 'nonrobust' likha hota hai, iska matlab hai:

* “Hum normal calculation use kar rahe hain aur maan rahe hain ke data me koi weird cheez (jaise unequal errors) nahi hai.”

* Matlab, standard errors aur p-values jo mile hain, wo sirf tab sahi hain agar data normal ho aur errors barabar hon.

* Agar data me gadbad hai (errors barabar nahi, ya weird distribution), to robust use karte hain.

* By default statsmodels uses nonrobust. But if you want robust standard errors (to handle heteroscedasticity, outliers, etc.), you can set it manually:

      model = smf.ols('weight ~ height + age', data=data).fit(cov_type='HC3')

cov_type='HC3' → robust standard errors (more reliable if residuals are not perfect)

> In simple analogy

> Default = “I assume your errors are nice & clean” → nonrobust

> Robust = “I don’t trust errors completely, let’s adjust standard errors”



# **4. R-Square and Adjusted R-Square:**



* R² = “model ne kitna accurately target variable predict kiya”. mtlb kitna variation model explain kar paa raha hai.

* Weight perfectly height aur age se linear follow karta hai

* Model ne har ek point bilkul sahi predict kar diya → R² = 1.0

* Matlab: 100% variation explained → model ne weight ke har change ko explain kar diya

* Problem: Agar tum random useless variables add karo, R² hamesha thoda badh jata hai, chahe wo variable kaam ka na ho.

* Example:

> Tumhare paas weight predict karna hai → height aur age use karo → R² = 0.9
Ab tum extra variable add kar do, jaise shoe size → R² automatically 0.91 ho gaya
Matlab R² badh gaya, lekin “shoe size” ka weight pe real effect nahi hai → ye misleading hai.


**Adjusted R² solution:**

* Adjusted R² = R² ko adjust karta hai number of variables aur observations ke hisaab se

* Matlab: “Sirf useful variables ka effect count karo, useless variables ignore karo”

Formula (simplified version):
      
      Adjusted R² = 1 – (1 – R²) * (n – 1)/(n – p – 1)
n = observations

p = number of predictors

* **Super simple analogy**

* Tumhare paas lego blocks aur unka weight = height + age se determine hota hai

* R² check karta hai → “Kitna weight model ne explain kiya”

* Agar tum extra useless block add karo → R² thoda badh jaata hai

* Adjusted R² → “Useless block ka effect ignore karo, sirf useful blocks count karo”

**One line version**

* Adjusted R² = R² ka smarter version, jo sirf meaningful variables ka effect dikhata hai, useless variables ignore karta hai



# **5. F-statistics and Prob(F-statistics):**

* F-statistic → check karta hai → model kitna achha fit hai compare to baseline model (jo bas average predict karta hai) -> “Tumhara model random guess se better hai ya nahi?”

* Bigger F-statistic = better model

* Prob (F-statistic) = probability that your model is NOT better than random guessing

* Small number (like 0.0001) → matlab “bohot almost impossible ki model random guess jaisa hai” → model really works

Prob F = bohot choti → matlab model actual me kaam kar raha hai, bas chance se guessing nahi kr raha



# **6. Log Likelihood, AIC, BIC**

* “how well did your guesses match the real weights?”
> Better guesses → bigger Log-Likelihood

> Bad guesses → smaller Log-Likelihood

**AIC (Akaike Information Criterion):**

* What it is: A number that balances model fit and model complexity

      AIC = 2*(number of parameters) - 2*log-likelihood

* Smaller AIC → better model

**Analogy:**

* You can have a simple model (2 ingredients) or a complex model (10 ingredients).

* Complex model might fit perfectly, but too many ingredients = bad idea

* AIC = “score that punishes too many ingredients”

* Goal: smallest score → best balance between fit & simplicity

**BIC (Bayesian Information Criterion)**

* Similar to AIC, but punishes complexity more strictly, especially if dataset is small

Formula (conceptually):

      BIC = log(n)*(number of parameters) - 2*log-likelihood

Analogy:

Like AIC, but your teacher is stricter about adding too many ingredients

So BIC encourages even simpler models

| Term           | What it means                       | Good/bad?        |
| -------------- | ----------------------------------- | ---------------- |
| Log-Likelihood | How likely model predicted the data | Bigger = better  |
| AIC            | Balance fit vs complexity           | Smaller = better |
| BIC            | Stricter balance fit vs complexity  | Smaller = better |

✅ Quick tip:

Log-Likelihood → tells how well model fits

AIC/BIC → tells fit + simplicity, helps choose best model




# **7. coef, std err, t, P>|t|,[0.025 0.975]**

* **Coeff** is the number that tells effect of each independent variable on target variable. e.g. har 1 unit height badhne par weight 0.0912 unit badhta hai, har 1 unit age badhne par weight 1.8176 unit badhta hai, har 1 unit intercept badhne par weight -0.0322 unit kam hai.

weight = -0.032 + 0.091*height + 1.818*age

* **std err:** Ye uncertainty ka measure hai har coefficient ka. Tiny numbers (1e-16 etc) → matlab data perfect fit kar raha hai, almost no error

* **T-test** check karta hai:
👉 "Kya yeh coefficient actually zero se different hai?"
  > Agar coefficient = 0 hota, matlab “height/age ka koi effect hi nahi weight pe.”

  > Agar coefficient ≠ 0 hai, matlab “haan, iska kuch effect hai.”

        Formula: t = coefficient ÷ standard error

  > Big t-value → matlab effect kaafi strong hai (zero se door hai).

  > Small t-value → shayad effect zero hi ho.

* **3. P>|t| (p-value)::**

Ye ek probability hai.

👉 “Kitna chance hai ki coefficient bas random noise hai aur asal me zero hi hai?”

Chhota p-value (jaise < 0.05) → matlab chance bahut kam hai ki effect zero hai → coefficient real effect hai.

Bada p-value (jaise 0.5) → matlab chance zyada hai ki effect bas random hai → coefficient useless.

**Example:**

Socho tum science fair me experiment karte ho:

Tumne dekha ki “doodh peene se height badhti hai.”

Coefficient = +2 cm (ye effect).

T-test = check karna ki “ye sach me doodh ki wajah se hai ya bas coincidence?”

P-value = chance of coincidence.

Agar p-value chhoti hai → matlab doodh sach me kaam kar raha hai.
Agar p-value badi hai → matlab doodh ka koi asar nahi, bas random bachhe waise hi lambe hue.



# **8. Omnibus, Prob(Omnibus), Skew, Kurtosis, Durbin Watson, Jarque-Bera (JB), Prob(JB), Cond. No.**

Ab hum regression ke result ke bottom section me aa gaye hain.
Ye mostly diagnostic tests hote hain – model ka health checkup jese.

## **A. Omnibus & Prob(Omnibus)**

* Ye ek test hai: "Model ke residuals (errors) normal distribution jese lag rahe hain ya nahi?"

* “Residuals” = model ke predict kiye hue values aur asli values ka difference.

* Agar residuals normal ho → good model assumption.

* NaN aa raha hai kyunki tumhara dataset bahut chhota hai (4 rows). Itna chhota sample pe test kaam nahi karta.

* Agar bada dataset hota, aur Prob(Omnibus) < 0.05 hota → matlab residuals normal nahi hain (buri baat).


## **B. Jarque-Bera (JB) & Prob(JB)**

* Ye bhi ek test hai residuals normal hain ya nahi check karne ka. Ye bhi residuals ki normality check karta hai (Omnibus ka bhai 😅).

* JB test dekhta hai skewness aur kurtosis ke basis pe.

* Prob(JB) = chance ke residuals normal hain.

* Yahan Prob(JB) = 0.618 (bahut bada hai, 0.05 se upar) → matlab residuals thik thak normal hain (good).

## **C. Skew**

* Residuals kitne tilt hue hain left ya right.

* Skew = 0 → perfectly balanced.

* Negative skew (jaise -1.155) → residuals thoda left me tilt hain.

* Thoda sa skew chalta hai, bahut zyada ho to problem hoti hai.

## **D. Kurtosis**

* Ye batata hai distribution ki “peakedness” → matlab curve kitna sharp ya flat hai compared to normal distribution.

* Normal distribution me kurtosis ~3 hoti hai.

* Tumhare result me 2.333 → thoda flat (lekin near 3, so fine).


## **E. Durbin-Watson**

* Ye test karta hai: “Residuals ek dusre se related (correlated) hain kya?”

* Range = 0 to 4.

 > ~2 → perfect (no autocorrelation).

 > <1 → strong positive autocorrelation (residuals line wise linked hain).

 > 3 → negative autocorrelation.

Tumhare result me = 0.027 → matlab bahut strong autocorrelation hai 😅 (galat sign ki nishani, mostly kyunki data tiny aur perfectly correlated hai).

## **F. Cond. No. (Condition Number)**

* Ye multicollinearity (variables ek dusre se almost copy) detect karta hai.

* Agar cond. no. bahut bada (> 30,000) ho → multicollinearity ka doubt.

* Tumhare result me = 1.74e+19 (bahut hii huge) → matlab tumhare predictors (height & age) almost same information de rahe hain → isi liye collinearity warning aayi thi.

* Ye check karta hai multicollinearity (predictors zyada similar to nahi).

* Agar Cond. No. bahut bada ho (jaise 1.74e+19) → matlab predictors ek dusre ke saath lagbhag linearly dependent hain.


##✅ Super Simple Recap:

Omnibus & JB = normality tests.

Skew & Kurtosis = residuals ka shape.

Durbin-Watson = autocorrelation check.

Cond. No. = multicollinearity check.



# **9. Notes:**

👉 Ek line me summary:
Tumhara dataset bohat chhota hai aur predictors strongly related hain, is wajah se model ki reliability doubtful hai.

# **--------------------------- Appendices:---------------------------**

## **1. P-VALUE:**

A p-value is a number that helps you understand whether your results are statistically significant or result you observed might just be a coincidence.

**What it measures:**

* It measures the probability of observing your data, or something more extreme, if the null hypothesis is true.

* The null hypothesis usually says: “There is no effect” or “There is no difference.” This is the “default” idea that nothing special is happening.

* “If the null hypothesis were true (nothing special is happening), how likely is it to see the results I got—or results even more unusual than this?”

**How to interpret it:**

* Small p-value (usually ≤ 0.05): Strong evidence against the null hypothesis → you might reject the null. MTLB EFFECT TO HAI.

* Large p-value (usually > 0.05): Weak evidence against the null → you fail to reject the null. mtlb wakai koi khaas effect ni.

**Important:**

* A small p-value does not prove your own hypothesis is true; it just suggests your data is not likely under the null hypothesis.

* A large p-value does not prove the null hypothesis is true; it just means your data isn’t strong enough to reject it.

**Example:**

Suppose you test if a new drug lowers blood pressure.

Null hypothesis: “The drug has no effect.”

You get p = 0.03 → there’s only a 3% chance of seeing this result if the drug truly had no effect, so you might conclude the drug is effective.
