#ASSIGNMENT - A3(Part B)
# **Performing Probit Regression on NSSO68 dataset to identify non-vegetarians**

* **AUTHOR**     : Rakshitha Vignesh Sargurunathan               

* **VID**        : V01109007

* **CREATED ON** : 07/01/2024

#**Probit regression**

This analysis aims to analyze the factors influencing dietary choices, specifically identifying non-vegetarians using a probit regression model. The probit model is a type of regression used for binary dependent variables, where the link function is the cumulative distribution function of the normal distribution. This analysis will help understand the predictors of non-vegetarianism among the surveyed population.


**Link Function:** Probit regression uses the cumulative distribution function (CDF) of the normal distribution to model the relationship between the predictors and the binary outcome. The probit link function is the inverse of the standard normal cumulative distribution function.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
data=pd.read_csv('/content/drive/MyDrive/SCMA/A1a/NSSO68.csv')

In [None]:
data.head()

Unnamed: 0,slno,grp,Round_Centre,FSU_number,Round,Schedule_Number,Sample,Sector,state,State_Region,...,pickle_v,sauce_jam_v,Othrprocessed_v,Beveragestotal_v,foodtotal_v,foodtotal_q,state_1,Region,fruits_df_tt_v,fv_tot
0,1,4.099999999999999e+31,1,41000,68,10,1,2,24,242,...,0.0,0.0,0.0,0.0,1141.4924,30.942394,GUJ,2,12.0,154.18
1,2,4.099999999999999e+31,1,41000,68,10,1,2,24,242,...,0.0,0.0,0.0,17.5,1244.5535,29.286153,GUJ,2,333.0,484.95
2,3,4.099999999999999e+31,1,41000,68,10,1,2,24,242,...,0.0,0.0,0.0,0.0,1050.3154,31.527046,GUJ,2,35.0,214.84
3,4,4.099999999999999e+31,1,41000,68,10,1,2,24,242,...,0.0,0.0,0.0,33.333333,1142.591667,27.834607,GUJ,2,168.333333,302.3
4,5,4.099999999999999e+31,1,41000,68,10,1,2,24,242,...,0.0,0.0,0.0,75.0,945.2495,27.600713,GUJ,2,15.0,148.0


In [None]:
data.columns

Index(['slno', 'grp', 'Round_Centre', 'FSU_number', 'Round', 'Schedule_Number',
       'Sample', 'Sector', 'state', 'State_Region',
       ...
       'pickle_v', 'sauce_jam_v', 'Othrprocessed_v', 'Beveragestotal_v',
       'foodtotal_v', 'foodtotal_q', 'state_1', 'Region', 'fruits_df_tt_v',
       'fv_tot'],
      dtype='object', length=384)

In [None]:
# Define the target variable (1 if non-vegetarian, 0 if vegetarian)
data['is_non_vegetarian'] = (data['nonvegtotal_q'] > 0).astype(int)

In [None]:
# Select features
features = ['hhdsz', 'Religion', 'Social_Group', 'Type_of_land_owned', 'Land_Owned',
            'MPCE_URP', 'Age', 'Sex', 'Education', 'Regular_salary_earner']

In [None]:
# Prepare the feature matrix (X) and target vector (y)
X = data[features]
y = data['is_non_vegetarian']

In [None]:
print("X sample: \n\n",X.head())
print("\n\nY sample: \n\n",y.head())

X sample: 

    hhdsz  Religion  Social_Group  Type_of_land_owned  Land_Owned  MPCE_URP  \
0      5       1.0           3.0                 1.0         1.0   3304.80   
1      2       3.0           9.0                 1.0         1.0   7613.00   
2      5       1.0           9.0                 1.0         2.0   3461.40   
3      3       3.0           9.0                 1.0         3.0   3339.00   
4      4       1.0           9.0                 1.0         2.0   2604.25   

   Age  Sex  Education  Regular_salary_earner  
0   50    1        8.0                    1.0  
1   40    2       12.0                    1.0  
2   45    1        7.0                    1.0  
3   75    1        6.0                    1.0  
4   30    1        7.0                    2.0  


Y sample: 

 0    0
1    0
2    0
3    0
4    0
Name: is_non_vegetarian, dtype: int64


In [None]:
# Handle missing values
# Remove rows with missing values and Adding a constant to the feature matrix (intercept term)
X = X.dropna()
y = y.loc[X.index]

X = sm.add_constant(X)

In [None]:
X

Unnamed: 0,const,hhdsz,Religion,Social_Group,Type_of_land_owned,Land_Owned,MPCE_URP,Age,Sex,Education,Regular_salary_earner
0,1.0,5,1.0,3.0,1.0,1.0,3304.80,50,1,8.0,1.0
1,1.0,2,3.0,9.0,1.0,1.0,7613.00,40,2,12.0,1.0
2,1.0,5,1.0,9.0,1.0,2.0,3461.40,45,1,7.0,1.0
3,1.0,3,3.0,9.0,1.0,3.0,3339.00,75,1,6.0,1.0
4,1.0,4,1.0,9.0,1.0,2.0,2604.25,30,1,7.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...
101657,1.0,6,1.0,9.0,2.0,303.0,817.00,39,1,7.0,2.0
101658,1.0,5,1.0,9.0,2.0,607.0,773.20,38,1,6.0,1.0
101659,1.0,7,1.0,9.0,2.0,404.0,663.29,42,1,5.0,2.0
101660,1.0,5,1.0,9.0,2.0,404.0,847.20,40,1,8.0,2.0


In [None]:
from statsmodels.discrete.discrete_model import Probit
# Fit the Probit regression model
probit_model = Probit(y, X).fit()

         Current function value: 0.000000
         Iterations: 35


In [None]:
# Print the summary of the model
print(probit_model.summary())

                          Probit Regression Results                           
Dep. Variable:      is_non_vegetarian   No. Observations:                87155
Model:                         Probit   Df Residuals:                    87144
Method:                           MLE   Df Model:                           10
Date:                Mon, 01 Jul 2024   Pseudo R-squ.:                     inf
Time:                        10:09:30   Log-Likelihood:            -1.8842e-07
converged:                      False   LL-Null:                        0.0000
Covariance Type:            nonrobust   LLR p-value:                     1.000
                            coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                    -8.3584   3856.734     -0.002      0.998   -7567.418    7550.701
hhdsz                    -0.0112    149.733  -7.48e-05      1.000    -293.483     293.460
Religion

#**Interpretation of Coefficients**

The coefficients (coef) indicate the impact of each predictor on the latent variable underlying the probability of being non-vegetarian. Here are the key coefficients and their interpretations:

* **const: -8.3584 (Intercept)**

This value is not significant (P>|z| = 0.998), indicating that the baseline probability of being non-vegetarian when all predictors are zero is not reliable.

* **hhdsz: -0.0112 (Household size)**

The coefficient is not significant (P>|z| = 1.000), suggesting no meaningful impact of household size on the probability of being non-vegetarian.

* **Religion: -0.1433**

This value is not significant (P>|z| = 1.000), indicating religion does not significantly affect the likelihood of being non-vegetarian.

* **Social_Group: 0.0314**

Not significant (P>|z| = 1.000), indicating no significant effect of social group.

* **Type_of_land_owned: 0.0153#**

Not significant (P>|z| = 1.000), suggesting no effect of land ownership type.

* **Land_Owned: 4.604e-06**

Not significant (P>|z| = 1.000), indicating no significant effect of land owned.

* **MPCE_URP: -5.608e-05**

Not significant (P>|z| = 1.000), suggesting no effect of monthly per capita expenditure (MPCE).

* **Age: 0.0128**

Not significant (P>|z| = 1.000), indicating age does not significantly affect the likelihood of being non-vegetarian.

* **Sex: 0.3314**

Not significant (P>|z| = 1.000), indicating no significant effect of sex.

* **Education: 0.0140**

Not significant (P>|z| = 1.000), indicating education level does not significantly affect the probability of being non-vegetarian.

* **Regular_salary_earner: 0.1640**

Not significant (P>|z| = 1.000), suggesting no effect of being a regular salary earner.

#**Characteristics and Advantages of the Probit Model**


* Normal Distribution Assumption: The probit model assumes that the underlying latent variable follows a normal distribution, which can be more appropriate for certain datasets compared to the logistic distribution.

* Interpretability: The probit model coefficients can be interpreted in a similar way to logistic regression coefficients, although they represent the change in the z-score of the latent variable for a one-unit change in the predictor.

* Handling of Binary Outcomes: Like logistic regression, the probit model is specifically designed for binary outcome variables, making it suitable for classification problems.

* Flexibility: Probit models can be extended to handle ordinal and multinomial outcomes, providing flexibility for various types of dependent variables.


#**Conclusion:**

The probit regression model offers a robust alternative to logistic regression, particularly when the normal distribution assumption is appropriate for the data. By applying probit regression to the dataset, we can identify significant predictors of being non-vegetarian and gain insights into the factors influencing dietary choices.