# Linear Regression with Coefficient Analysis

**Code Overview:**

This Python script performs linear regression analysis with coefficient insights. It's designed for ease of use, taking a CSV dataset and user-defined regression equation as input. The script calculates regression coefficients, p-values, R-squared, and adjusted R-squared, offering valuable insights into relationships between variables.

- **Data Prep**: Prepare your CSV dataset with a header row.
- **Model Definition**: Input the regression equation using Patsy syntax.
- **Results**: Get coefficient details, p-values, R-squared, and adjusted R-squared.

While suitable for basic analysis, for advanced statistics, consider specialized libraries like scikit-learn or statsmodels. Ensure your dataset is well-formatted, and you understand the regression equation. Learn more about Patsy syntax in the [documentation](https://patsy.readthedocs.io/en/stable/formulas.html).


In [1]:
# importing necessary libraries for operations

import numpy as np
import pandas as pd
import sklearn.linear_model as lm
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.preprocessing import scale
import patsy
import sklearn as skl
import scipy

## Credit Dataset (ISL)

The Credit dataset provided by ISL (Introduction to Statistical Learning) serves as an excellent educational resource and illustrative dataset in the realm of statistical learning and machine learning. This dataset typically contains information about credit applicants and is used to demonstrate various statistical and machine learning concepts.

**Key Characteristics of the Credit Dataset:**

- **Attributes**: The dataset typically includes a variety of attributes or features that describe each credit applicant. These attributes may include information such as age, income, credit score, education level, and more.

- **Response Variable**: The dataset often includes a response variable that indicates whether a credit application was approved or denied. This binary response variable makes the dataset suitable for binary classification tasks.

**Access the Credit Dataset:**
- You can download the [Credit dataset](https://www.statlearning.com/resources-python) from the ISL website.

The Credit dataset from ISL provides a valuable hands-on experience for individuals learning about data analysis, machine learning, and credit risk assessment. It demonstrates how real-world datasets can be used to make informed decisions and manage risk effectively in financial applications.


In [2]:
credit=pd.read_csv('Credit.csv')

In [3]:
credit

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Own,Student,Married,Region,Balance
0,14.891,3606,283,2,34,11,No,No,Yes,South,333
1,106.025,6645,483,3,82,15,Yes,Yes,Yes,West,903
2,104.593,7075,514,4,71,11,No,No,No,West,580
3,148.924,9504,681,3,36,11,Yes,No,No,West,964
4,55.882,4897,357,2,68,16,No,No,Yes,South,331
...,...,...,...,...,...,...,...,...,...,...,...
395,12.096,4100,307,3,32,13,No,No,Yes,South,560
396,13.364,3838,296,5,65,17,No,No,No,East,480
397,57.872,4171,321,5,67,12,Yes,No,Yes,South,138
398,37.728,2525,192,1,44,13,No,No,Yes,South,0


In [4]:
import numpy as np
import pandas as pd
import patsy
from scipy import stats

# Define a function for linear regression
def reg(df, eq):
    # Create the design matrices for the regression equation
    y, x = patsy.dmatrices(eq, df)
    
    # Calculate the transpose of x
    xT = np.transpose(x)
    
    # Calculate the regression coefficients using the normal equation
    B = np.linalg.inv(xT @ x) @ xT @ y
    
    # Calculate the residuals
    e = x @ B - y
    
    # Calculate the sum of squared errors (SSE)
    E = np.transpose(e) @ e
    
    # Calculate the total sum of squares (SST)
    SST = (np.transpose(y) @ y) - (len(df) * ((np.mean(y)) ** 2))
    
    # Calculate the coefficient of determination (R-squared)
    r2 = 1 - (E / SST)
    
    # Calculate the adjusted R-squared
    adj_r2 = 1 - ((len(df) - 1) / (len(df) - len(B))) * (1 - r2)
    
    # Calculate the sum of squared residuals (SSQ)
    ssq = E / (len(df) - len(B))
    ssq = np.array(ssq)[0][0]
    
    # Calculate the standard errors of coefficients (SE)
    A = np.asmatrix(ssq * np.linalg.inv(xT @ x))
    SE = np.transpose(np.asmatrix(np.sqrt(np.diag(A))))
    
    # Calculate the t-statistics (T) for coefficients
    T = np.absolute(B / SE)
    
    # Calculate the p-values for coefficients
    P = np.matrix.round((1 - stats.t.cdf(T, len(df) - len(B))) * 2, decimals=2)
    
    # Combine coefficients and p-values into a single array
    Res = np.concatenate((B, P), axis=1)
    
    # Create a DataFrame to store coefficient and p-value information
    res = pd.DataFrame(data=Res)
    res.columns = ["Coefficient", "P_Values"]
    
    return res, r2, adj_r2

# Input filename from the user
x = input("Enter your file name: ")

# Read data from the CSV file into a DataFrame
df = pd.read_csv(x)

# Input the regression model equation
eq = input('Enter the model: ')

# Perform linear regression and store the results
ans = reg(df, eq)

# Display the coefficients and p-values
print(ans[0])

# Display R-squared value
print('R Square:', np.round(ans[1][0][0], 3))

# Display adjusted R-squared value
print('Adjusted R Square:', np.round(ans[2][0][0], 3))

Enter your file name: Credit.csv
Enter the model: Balance~Rating
   Coefficient  P_Values
0  -390.846342       0.0
1     2.566240       0.0
R Square: 0.746
Adjusted R Square: 0.745


In [5]:
# verifying the results using sklearn's pre-defined functions

fit=smf.ols('Balance~Rating',credit)
fit=fit.fit()
print(fit.summary().tables[0])
print(fit.summary().tables[1])

                            OLS Regression Results                            
Dep. Variable:                Balance   R-squared:                       0.746
Model:                            OLS   Adj. R-squared:                  0.745
Method:                 Least Squares   F-statistic:                     1168.
Date:                Sun, 10 Sep 2023   Prob (F-statistic):          1.90e-120
Time:                        21:51:24   Log-Likelihood:                -2745.4
No. Observations:                 400   AIC:                             5495.
Df Residuals:                     398   BIC:                             5503.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -390.8463     29.069    -13.446      0.0