# Ginsberg Data Analysis

In this file, depression data from the Ginsberg dataset will be analyzed using a multiple regression.

In [1]:
# Load Packages

import pandas as pd
import statsmodels.api as model

In [2]:
# Load Data

ginsberg = pd.read_csv("ginsberg.csv")

# Verifying Data Integrity

In [3]:
ginsberg.columns

#No column names found, must be added
ginsberg.columns = ["SUBJECT", "SIMPLICITY", "FATALISM", "DEPRESSION", "ADJUSTED SIMPLICITY", "ADJUSTED FATALISM", "ADJUSTED DEPRESSION"]

ginsberg.columns

Index(['SUBJECT', 'SIMPLICITY', 'FATALISM', 'DEPRESSION',
       'ADJUSTED SIMPLICITY', 'ADJUSTED FATALISM', 'ADJUSTED DEPRESSION'],
      dtype='object')

In [4]:
#Checking first few rows
ginsberg.head()

#Since the row names overwrite the first row, the header = FALSE argument is needed when loading data.

Unnamed: 0,SUBJECT,SIMPLICITY,FATALISM,DEPRESSION,ADJUSTED SIMPLICITY,ADJUSTED FATALISM,ADJUSTED DEPRESSION
0,2,0.91097,1.18439,0.72787,0.72717,0.99915,0.51688
1,3,0.53366,-0.05837,0.53411,0.62176,0.03811,0.70699
2,4,0.74118,0.35589,0.56641,0.83522,0.42218,0.65639
3,5,0.53366,0.77014,0.50182,0.47697,0.81423,0.53518
4,6,0.62799,1.39152,0.56641,0.40664,1.23261,0.34042


In [5]:
ginsberg = pd.read_csv("ginsberg.csv", header = None)

In [6]:
ginsberg.columns = ["SUBJECT", "SIMPLICITY", "FATALISM", "DEPRESSION", "ADJUSTED SIMPLICITY", "ADJUSTED FATALISM", "ADJUSTED DEPRESSION"]

ginsberg.columns

Index(['SUBJECT', 'SIMPLICITY', 'FATALISM', 'DEPRESSION',
       'ADJUSTED SIMPLICITY', 'ADJUSTED FATALISM', 'ADJUSTED DEPRESSION'],
      dtype='object')

In [7]:
ginsberg.head()

Unnamed: 0,SUBJECT,SIMPLICITY,FATALISM,DEPRESSION,ADJUSTED SIMPLICITY,ADJUSTED FATALISM,ADJUSTED DEPRESSION
0,1,0.92983,0.35589,0.5987,0.75934,0.10673,0.41865
1,2,0.91097,1.18439,0.72787,0.72717,0.99915,0.51688
2,3,0.53366,-0.05837,0.53411,0.62176,0.03811,0.70699
3,4,0.74118,0.35589,0.56641,0.83522,0.42218,0.65639
4,5,0.53366,0.77014,0.50182,0.47697,0.81423,0.53518


In [8]:
#Checking the number of unique Subjects

ginsberg.shape
ginsberg['SUBJECT'].nunique()

82

All Subject values are unique.

In [9]:
ginsberg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   SUBJECT              82 non-null     int64  
 1   SIMPLICITY           82 non-null     float64
 2   FATALISM             82 non-null     float64
 3   DEPRESSION           82 non-null     float64
 4   ADJUSTED SIMPLICITY  82 non-null     float64
 5   ADJUSTED FATALISM    82 non-null     float64
 6   ADJUSTED DEPRESSION  82 non-null     float64
dtypes: float64(6), int64(1)
memory usage: 4.6 KB


All columns are float values, which is expected. Thus, a regression can be run.

# Create the Regression

Declaring the dependent and independent variables:

In [10]:
y = ginsberg['DEPRESSION']
x1 = ginsberg[['SIMPLICITY','FATALISM']]

## Regression

In [11]:
x = model.add_constant(x1)
results = model.OLS(y,x).fit()
results.summary()

0,1,2,3
Dep. Variable:,DEPRESSION,R-squared:,0.519
Model:,OLS,Adj. R-squared:,0.507
Method:,Least Squares,F-statistic:,42.58
Date:,"Wed, 20 Jul 2022",Prob (F-statistic):,2.84e-13
Time:,18:13:16,Log-Likelihood:,-29.024
No. Observations:,82,AIC:,64.05
Df Residuals:,79,BIC:,71.27
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2027,0.095,2.140,0.035,0.014,0.391
SIMPLICITY,0.3795,0.101,3.771,0.000,0.179,0.580
FATALISM,0.4178,0.101,4.151,0.000,0.217,0.618

0,1,2,3
Omnibus:,11.218,Durbin-Watson:,1.179
Prob(Omnibus):,0.004,Jarque-Bera (JB):,11.425
Skew:,0.857,Prob(JB):,0.0033
Kurtosis:,3.636,Cond. No.,6.0


# Results and Interpretation

Based on the regression results, both Simplicity and Fatalism are statistically significant variables at the 95% confidence level. The variables have a P value less than .05, and the Confidence Interval does not contain 0.

The model's overall R^2 is .519, which indicates a moderately positive correlation. Since the F-statistic probability is less than .05, the model has some level of trend.

Overall, the model suggests Depression = (.3795 * Simplicity) + (.4178 * Fatalism) + .2027.