# **Logistic Regression in StatsModels using Formula API:**

* Linear regression → jab number predict karna ho (jaise weight).

* Logistic regression → jab Yes/No ya 0/1 predict karna ho (jaise patient ko disease hai ya nahi, student pass hai ya fail).


In [None]:
import statsmodels.formula.api as smf
import pandas as pd

data = pd.DataFrame({
    'age': [20, 25, 30, 35, 40, 45, 28, 33],
    'smoker': [0, 0, 0, 1, 1, 1, 1, 0]   # ab perfect separation nahi hai
})

model = smf.logit('smoker ~ age', data=data).fit()
print(model.summary())


Optimization terminated successfully.
         Current function value: 0.417667
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                 smoker   No. Observations:                    8
Model:                          Logit   Df Residuals:                        6
Method:                           MLE   Df Model:                            1
Date:                Tue, 02 Sep 2025   Pseudo R-squ.:                  0.3974
Time:                        17:47:42   Log-Likelihood:                -3.3413
converged:                       True   LL-Null:                       -5.5452
Covariance Type:            nonrobust   LLR p-value:                   0.03578
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -9.4853      6.708     -1.414      0.157     -22.632       3.662
age            0.2992      0.

Predictions:

In [None]:
print(model.predict(pd.DataFrame({'age':[28, 50]})))

0    0.248232
1    0.995823
dtype: float64


Ye bolega:

28 saal wale ke smoker hone ka chance 24% eh.

50 saal wale ke smoker hone ka chance kitna 99% eh.

## **Issues in Logistic Regression:**

Logistic regression simple sa rule follow karta hai:

 > Agar data 100% perfectly separate ho jaye (jaise tumhare case me age → smoker), to wo fit hi nahi kar paata → aur wahi error deta hai: PerfectSeparationWarning aur Singular matrix. Ye ek math ka error hai jo tab hota hai jab equation solve hi na ho sake.
Tumhari situation me → kyunki predictor (age) ne outcome (smoker) ko perfectly split kar diya (0 aur 1 me).

 > Agar data me thoda overlap ho (kuch young smokers, kuch older non-smokers), to model fit ho jata hai aur achi summary deta hai.


# **Multiple Regression in StatsModels using Formula API:**

Ye asal me wahi OLS (linear regression) hai, bas ek se zyada predictors:

        model = smf.ols("y ~ x1 + x2 + x3", data=df).fit()




| Problem type        | Formula API possible? | Example                   |        |
| ------------------- | --------------------- | ------------------------- | ------ |
| Linear regression   | ✅                     | weight \~ height + age    |        |
| Multiple regression | ✅                     | y \~ x1 + x2 + x3         |        |
| Logistic regression | ✅                     | outcome \~ age + gender   |        |
| Poisson/GLM         | ✅                     | counts \~ x1 + x2         |        |
| ANOVA               | ✅ (via ols)           | score \~ C(group)         |        |
| Mixed effects       | ✅                     | y \~ x + (1 group) |
| Time series         | ❌ (Data API)                     | ARIMA, SARIMAX            |        |
| t-test / chi-square | ❌ (Use simple funcs)                    | ttest\_ind(), chisquare() |        |
| PCA / clustering    | ❌                     | sm or sklearn functions   |        |


🔹 Step 1: Dekho problem ka type

**Regression type problem?**

Predict karna hai koi numeric outcome (weight, salary, temperature) → linear regression

Predict karna hai categorical outcome (0/1, yes/no, type of class) → logistic regression, Poisson regression, GLM
✅ Ye sab formula API me easy hoti hai: 'y ~ x1 + x2'

**Time series problem?**

Predict future values based on past values → ARIMA, SARIMAX, ETS
❌ Formula API nahi hoti, direct series data API use karte ho

**Hypothesis test / basic stats?**

t-test, chi-square, correlation → formula API nahi, simple functions use karte ho

      sm.stats.ttest_ind(x1, x2)
      sm.stats.chisquare(observed)


**Mixed effects / grouped data?**

Agar data me groups hain (schools, patients, repeated measures) → smf.mixedlm
✅ Formula API se possible

# **----------------------- Data API -----------------------**



* Tum khud apna Y (dependent variable) aur X (independent variables) nikal kar arrays/series ke form me model ko dete ho.

Difference:

* Formula API:

```
import statsmodels.formula.api as smf

model = smf.ols('weight ~ height + age', data=df).fit()

```

* Data API:

```
import statsmodels.api as sm

X = df[['height', 'age']]   # independent variables
X = sm.add_constant(X)      # intercept add karna padta hai manually
y = df['weight']            # dependent variable

model = sm.OLS(y, X).fit()

```



# **3️⃣ Data API kab use karte hain?**

✅ Jab:

* Tumhare paas formula likhne ki zarurat nahi E.G 'weight ~ height + age' (direct arrays bana liye).

* Tumhe full control chahiye (intercept add karna, design matrix banana, lagged variables generate karna).

* Time series models (ARIMA, SARIMA, VAR) — kyunki inme formula API kam use hota hai, zyadatar Data API use hota hai.

# **4️⃣ Important Models with Data API**

🔸 Linear Regression

      model = sm.OLS(y, X).fit()

🔸 Logistic Regression

      model = sm.Logit(y, X).fit()

🔸 Time Series

ARIMA:

      model = sm.tsa.ARIMA(series, order=(p,d,q)).fit()


SARIMA:

      model = sm.tsa.SARIMAX(series, order=(p,d,q), seasonal_order=(P,D,Q,s)).fit()


VAR (Vector AutoRegression):

      model = sm.tsa.VAR(df).fit()


# **5️⃣ Cheezen jo Formula API me auto hoti hain, par Data API me manually karni padti hain**

* Intercept (constant column) → Formula API automatically add karta hai, Data API me tumhe sm.add_constant(X) likhna padta hai.

* Column reference by name → Formula API string me naam likh kar kaam kar leta hai, Data API me tumhe df[['col1','col2']] nikalna padta hai.

* Categorical encoding → Formula API automatically categorical variables ko dummy bana deta hai, Data API me tumhe khud pd.get_dummies() use karna padta hai.

# **🔹 Categorical Encoding kya hota hai?**

Machine learning / regression models numbers samajhte hain, words nahi.

Agar tumhare data me column aisa ho:
| Name | Color | Weight |
| ---- | ----- | ------ |
| A    | Red   | 10     |
| B    | Blue  | 15     |
| C    | Green | 20     |

Color ek categorical variable hai (Red, Blue, Green).
Model ko confuse ho jayega → ye to numbers hi nahi hain!

🔹 Formula API me kya hota hai?

Formula API automatically categories ko dummy variables bana deta hai.

Example:

```
import pandas as pd
import statsmodels.formula.api as smf

data = pd.DataFrame({
    'Weight': [10, 15, 20],
    'Color': ['Red', 'Blue', 'Green']
})

model = smf.ols('Weight ~ Color', data=data).fit()
print(model.summary())

```
👉 Isme Color automatically convert hoga:

Color[Blue]

Color[Green]

(ek category ko base bana ke, baaki ko 0/1 columns)

🔹 Data API me kya karna padta hai?

Data API automatic conversion nahi karega. Tumhe khud pd.get_dummies() likhna padta hai.

```
import statsmodels.api as sm

X = pd.get_dummies(data['Color'], drop_first=True)  # dummy variables manually
X = sm.add_constant(X)  
y = data['Weight']

model = sm.OLS(y, X).fit()
print(model.summary())

```



# **Time Series me kyu Data API?**

Time series models (ARIMA, SARIMA) formula API pe depend nahi karte.

Waha tum usually ek hi series dete ho (e.g., sales over time).

Example:

import statsmodels.api as sm

model = sm.tsa.ARIMA(series, order=(1,1,1)).fit()


👉 Yahan tum formula likh ke nahi bata rahe "sales ~ lagged_sales".
Model automatically time lags handle kar leta hai.

So, in time series use data API.

So,

Cross-sectional regression (OLS, logistic regression) → Formula API zyada convenient.

Time series analysis (ARIMA, SARIMA, VAR, seasonal trends) → Data API essential.

### **🔹 1. Cross-sectional data kya hota hai?**

Ek point of time pe alag-alag log/cheezein observe karte ho.

Example:

100 students ka height, weight, age ek hi saal me measure karna.

Ek din me 50 shops ki sales record karna.

👉 Matlab: “ek waqt pe, bohot saare subjects”

### **🔹 2. Time series data kya hota hai?**

Ek hi cheez ko alag-alag times pe dekhte ho.

Example:

1 dukaan ki sales har din ka record.

Pakistan ka GDP har saal ka record.

👉 Matlab: “ek subject, waqt ke sath repeat hote values”

🔹 3. Cross-sectional regression

Jab tum cross-sectional data use karke regression lagate ho.

Example:

Weight ~ Height + Age


Ek waqt pe alag-alag logon ka data, unke height & age se weight predict karte ho.

🔹 4. Time series regression / models

Jab tum ek hi cheez ka data over time le kar pattern samajhte ho.

Example:

Sales_t ~ Sales_(t-1) + Sales_(t-2)


Aaj ki sales ko kal & parso ki sales se predict karna.

ARIMA, SARIMA yahin pe aate hain.


