### Lab 01: Pandas, Numpy, and Statsmodels - Beginner's Guide

This notebook is designed to help understand Python and data science. It provides a detailed introduction to three essential libraries: Pandas, Numpy, and Statsmodels.

- **Pandas**: A library for data manipulation and analysis, offering powerful data structures like DataFrames.
- **Numpy**: A fundamental package for numerical computation in Python, supporting multi-dimensional arrays and mathematical functions.
- **Statsmodels**: A library for statistical modeling, enabling users to perform tasks like linear regression and hypothesis testing.

By the end of this lab, you will be able to:

1. Load and explore data using Pandas.
2. Perform basic data manipulations such as selecting, filtering, and creating new columns.
3. Use Numpy for numerical operations and understand the concept of vectors and dot products.
4. Apply Statsmodels to perform basic statistical analyses, including regression modeling.



## Pandas: Working with Tabular Data

Pandas is a Python library for data manipulation and analysis. It is especially useful for handling structured data (e.g., tables or spreadsheets).

### Key Topics
1. Loading data
2. Exploring data
3. Selecting and filtering data
4. Creating new columns


In [None]:
# install packages to use
import sys
!{sys.executable} -m pip install "numpy<2.0"
!{sys.executable} -m pip install pandas
!{sys.executable} -m pip install statsmodels

In [2]:
# Step 1: Import Libraries

import pandas as pd
import numpy as np
from statsmodels.regression.linear_model import OLS
from IPython.display import display


### Step 2: Loading Data
To load tabular data into Python, we use the `read_csv` method. 

In [None]:
# Load the dataset
file_path = 'possum.csv'
df = pd.read_csv(file_path)

# Display the first five rows
display(df.head())

### Step 3: Exploring Data
Once the data is loaded, we can explore its structure and contents.


In [None]:
# Get information about the dataset
df.info()

# Display basic statistics
display(df.describe())

In [None]:
column_1_bracket = df['age']
column_1_dot = df.age
column_1_loc = df.loc[:, 'age']

# Display the column values. They should be the same
display(column_1_bracket)
display(column_1_dot)
display(column_1_loc)


### Step 5: Creating a New Column
Pandas allows us to create new columns by performing operations on existing ones.

In [None]:
# Convert age from years to days
df['age_days'] = df['age'] * 365.25

# Display the first five rows
display(df.head())


### Step 6: Saving Data
You can save the modified DataFrame back to a CSV file.


In [7]:
# Save DataFrame to a new CSV file
df.to_csv('updated_possum.csv', index=False)

## Numpy: Numerical Computing

Numpy is a library for numerical operations. It is the backbone for many other scientific libraries in Python.

### Key Topics
1. Arrays
2. Array operations
3. Linear algebra

### Step 1: Creating Arrays


In [None]:
# Create a 1D array
array_1d = np.array([1, 2, 3, 4, 5])

# Create a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6]])

# Display arrays
print("1D Array:", array_1d)
print("2D Array:\n", array_2d)

### Step 2: Array Operations
Numpy arrays support element-wise operations.


In [None]:
# Basic arithmetic
array = np.array([10, 20, 30])

print("Addition:", array + 5)
print("Multiplication:", array * 2)


### Step 3: Reshaping Arrays
Transform arrays to desired shapes using `reshape`.

In [None]:
# Reshape a 1D array to 2D
array = np.array([1, 2, 3, 4])
reshaped_array = array.reshape(2, 2)

print("Original Array:", array)
print("Reshaped Array:\n", reshaped_array)
print()

# Transpose a 2D array
x = np.array([1, 2, 3, 4])
print(x)
print(x.T) #Transposing does not change a 1d array
print()

y = np.array([[5, 6, 7, 8]])
print(y)
print(y.T) #Transposing changes a 2d array from a row vector to a column vector



A useful function to transform 1-d arrays into 2-d arrays is the function reshape. We can transform or data into (n,1) shaped arrays using x.reshape(-1,1). When you pass -1 to reshape, you're telling numpy to infer the shape in that dimension. So if I had an array, z, of 3 elements and I called z.reshape(-1,1). This will reshape the array to be a (3,1) array. We didn't have to tell numpy the size for the first dimension, numpy inferred it from the size of the array.


In [None]:
print(x)

z = x.reshape(-1,1)

print(z)

### Step 4: Dot Product
Perform linear algebra operations like dot products.


In [None]:
# Define vectors
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])

# Calculate dot product
dot_product = np.dot(x, y)
print("Dot Product:", dot_product)

### Step 5: Squeeze
Reshape array using Squeeze

In [None]:
# Using Squeeze on a 2D array to convert it to a 1D array

array = np.array([[1, 2, 3]])
print("Original Array:", array)

# Squeeze the array
squeezed_array = np.squeeze(array)

print("Squeezed Array:", squeezed_array)
# this removes the extra dimension



## Statsmodels: Simple Linear Regression

Statsmodels is a library for statistical modeling. Here, we perform a simple linear regression.

### Step 1: Prepare Data

In [13]:
import statsmodels.api as sm
# Example dataset
x = np.array([1, 2, 3, 4, 5, 6, 8, 11, 12, 13])
y = np.array([2, 4, 5, 4, 5, 6, 7, 8, 9, 12])

# Add a constant term for the intercept
X = sm.add_constant(x)

### Step 2: Fit Model

In [None]:
# Fit an Ordinary Least Squares (OLS) model
model = OLS(y, X).fit()

# Print summary
print(model.summary())

# Explanation of the Terms above

## General Model Information:

Dep. Variable: The dependent (response) variable (y) being predicted or modeled.

Model: Specifies the type of regression model used (in this case, OLS).

Method: Indicates the method used to fit the model, which is "Least Squares."

No. Observations: Number of data points (observations) used in the model (10 in this case).

Df Residuals: Degrees of freedom of residuals, calculated as the number of observations minus the number of model parameters estimated (10 - 2 = 8 here).

Df Model: Degrees of freedom of the model, representing the number of predictors (1 predictor here).

## Goodness-of-Fit Metrics:

R-squared: A measure of how well the independent variable(s) explain the variability in the dependent variable. A value of 0.910 indicates that 91% of the variance in y is explained by the model.

Adj. R-squared: Adjusted R-squared accounts for the number of predictors and penalizes adding variables that do not improve the model significantly. Here, it is slightly lower (0.898).

F-statistic: Tests the overall significance of the model. Higher values indicate a stronger model. Here, 80.46 suggests the model is highly significant.

Prob (F-statistic): The p-value for the F-statistic. A small value (1.90e-05) indicates strong evidence that the model is significant.

## Model Selection Metrics:

Log-Likelihood: The log of the likelihood function, used in comparison of models. Higher values indicate a better fit.

AIC (Akaike Information Criterion): A measure of model quality based on the likelihood and complexity. Lower values indicate a better fit.

BIC (Bayesian Information Criterion): Similar to AIC but penalizes model complexity more heavily.

## Coefficients Table:

coef: The estimated coefficients for the predictors. The const term (2.0228) represents the intercept, and x1 (0.6426) is the slope for the independent variable.

std err: The standard error of the coefficients, indicating the variability of the coefficient estimates.

t: The t-statistic for testing whether the coefficient is significantly different from zero.

P>|t|: The p-value associated with the t-statistic. Small values (e.g., < 0.05) indicate statistical significance. Both const (0.006) and x1 (0.000) are significant.

[0.025, 0.975]: The 95% confidence interval for the coefficients.