# Comprehensive Data Analysis Notebook

This notebook provides a consolidated example of the key features found across multiple data science projects. It covers the end-to-end workflow from loading data to building and evaluating a machine learning model.

**Features Covered:**
1.  **Data Loading and Exploration**: Reading data, basic inspection.
2.  **Data Manipulation**: Cleaning, transforming, and aggregating data.
3.  **Data Visualization**: Using Matplotlib, Seaborn, and Plotly for insights.
4.  **Machine Learning and Statistics**: Performing statistical tests and building a regression model.

### Setup: Importing Necessary Libraries

First, we import all the Python libraries we'll need for our analysis. This includes `pandas` for data manipulation, `numpy` for numerical operations, `matplotlib`, `seaborn`, and `plotly` for visualizations, and `scikit-learn` and `scipy` for machine learning and statistics.

In [None]:
import pandas as pd
import numpy as np

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Machine Learning and Statistics
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from scipy import stats

# To handle string data as a file for reading CSV
from io import StringIO

## 1. Data Loading and Exploration

This section covers loading data into our environment. We'll demonstrate reading from a CSV file and then explore the basic properties of our dataset to get a first understanding of its structure and content.

### Reading from a CSV

Here, we simulate a CSV file using a string and then read it into a pandas DataFrame. This is a common first step when your data is stored in a `.csv` file.

In [None]:
csv_data = '''product_id,product_name,price,launch_date
101,Gadget A,199.99,2023-01-15
102,Widget B,49.50,2023-02-20
103,Thing C,89.00,
104,Device D,249.99,2023-04-10
104,Device D,249.99,2023-04-10
105,Gizmo E,120.00,2023-05-25'''

data_file = StringIO(csv_data)
df_products = pd.read_csv(data_file)

print("Product data loaded successfully!")

### Initial DataFrame Exploration

We use methods like `.head()`, `.shape`, and `.info()` to inspect the first few rows, check the dimensions (rows, columns), and get a summary of data types and non-null values.

In [None]:
print("First 5 rows of the product data:")
display(df_products.head())

print(f"\nDataset dimensions (rows, columns): {df_products.shape}")

print("\nData types and non-null values:")
df_products.info()

## 2. Data Manipulation

Data is rarely perfect. This section covers essential data manipulation techniques like cleaning (handling duplicates and missing values), type conversion, and aggregation to prepare the data for analysis and modeling.

### Data Cleaning: Duplicates and Missing Values

We first check for and remove duplicate rows. Then, we identify columns with missing data and fill them using an appropriate strategy (in this case, filling the missing date with the most frequent date).

In [None]:
# --- Handling Duplicates ---
print(f"Number of duplicate rows: {df_products.duplicated().sum()}")
df_products.drop_duplicates(inplace=True)
print(f"Number of duplicates after cleaning: {df_products.duplicated().sum()}")

# --- Handling Missing Data ---
print("\nMissing values per column:")
print(df_products.isna().sum())

# Fill missing launch_date with the mode (most frequent value)
mode_date = df_products['launch_date'].mode()[0]
df_products['launch_date'].fillna(mode_date, inplace=True)

print("\nMissing values after filling:")
print(df_products.isna().sum())

### Data Type Conversion

Columns are often read with incorrect data types (e.g., dates as strings). Here, we convert the `launch_date` column from a generic object to a proper `datetime` type, which enables time-based analysis.

In [None]:
print("Data types before conversion:")
print(df_products.dtypes)

df_products['launch_date'] = pd.to_datetime(df_products['launch_date'])

print("\nData types after conversion:")
print(df_products.dtypes)

### Data Aggregation

To summarize data, we use `groupby()`. Here, we create a new 'launch_month' column and then group by it to calculate the average price of products launched each month.

In [None]:
df_products['launch_month'] = df_products['launch_date'].dt.month_name()
avg_price_by_month = df_products.groupby('launch_month')['price'].mean().reset_index()

print("Average product price by launch month:")
display(avg_price_by_month)

### NumPy Array Manipulation

`NumPy` is the foundation of numerical computing in Python. Here, we create a 2D array, inspect its properties, and perform basic operations like slicing to select a subset of the data and applying a mathematical function.

In [None]:
# Create a 3x4 NumPy array of random numbers
my_array = np.random.rand(3, 4) * 100

print("Original NumPy Array:")
print(my_array)

print(f"\nShape: {my_array.shape}")
print(f"Data Type: {my_array.dtype}")

# Slicing: get the first 2 rows and last 2 columns
subset = my_array[:2, 2:]
print("\nSliced Subset:")
print(subset)

# Applying a function: find the square root of all elements
sqrt_array = np.sqrt(my_array)
print("\nArray after applying sqrt:")
print(sqrt_array.round(2))

## 3. Data Visualization

Visualization is key to uncovering patterns and communicating findings. This section demonstrates how to create plots using `matplotlib`, `seaborn`, and `plotly`.

### Visualization with Matplotlib

`Matplotlib` is a versatile library for creating static plots. Here, we create a bar chart to visualize the average product prices by launch month that we calculated earlier.

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(avg_price_by_month['launch_month'], avg_price_by_month['price'], color='skyblue')
plt.title('Average Product Price by Launch Month', fontsize=16)
plt.xlabel('Month', fontsize=12)
plt.ylabel('Average Price ($)', fontsize=12)
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

### Visualization with Seaborn

`Seaborn` is built on top of Matplotlib and is excellent for statistical visualizations. We will load the California Housing dataset to demonstrate a regression plot, which shows the relationship between two variables along with a linear regression line.

In [None]:
# Load a sample dataset for regression analysis
housing = fetch_california_housing(as_frame=True)
df_housing = housing.frame

# Seaborn Regression Plot
plt.figure(figsize=(10, 6))
sns.regplot(data=df_housing, x='MedInc', y='MedHouseVal', 
            scatter_kws={'alpha':0.3}, line_kws={'color':'red'})
plt.title('Median Income vs. Median House Value in California', fontsize=16)
plt.xlabel('Median Income (in tens of thousands of $)', fontsize=12)
plt.ylabel('Median House Value (in hundreds of thousands of $)', fontsize=12)
plt.show()

### Visualization with Plotly

`Plotly` creates interactive visualizations that are great for web-based dashboards and exploration. Below is an interactive scatter plot where you can hover over points to see details.

In [None]:
# Using a sample of the data to keep the plot from being too crowded
df_sample = df_housing.sample(n=1000, random_state=42)

fig = px.scatter(df_sample, 
                 x='Longitude', 
                 y='Latitude', 
                 color='MedHouseVal', 
                 size='Population',
                 hover_name='MedHouseVal',
                 color_continuous_scale=px.colors.sequential.Viridis,
                 title='California Housing: Value by Geo-location')
fig.show()

## 4. Machine Learning and Statistics

This final section applies statistical testing and machine learning. We'll perform a t-test to compare two groups and then build a multivariable linear regression model to predict housing values.

### Statistical Analysis: Independent T-test

A t-test is used to determine if there is a significant difference between the means of two groups. Here, we create two sample groups and use `scipy.stats.ttest_ind` to see if their means are statistically different.

In [None]:
# Create two independent samples
group_a = np.random.normal(loc=105, scale=10, size=50)
group_b = np.random.normal(loc=100, scale=10, size=50)

# Perform independent t-test
t_stat, p_value = stats.ttest_ind(a=group_a, b=group_b)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("\nThe difference between the groups is statistically significant (p < 0.05).")
else:
    print("\nThe difference between the groups is not statistically significant (p >= 0.05).")

### Multivariable Linear Regression

We'll build a model to predict the median house value (`MedHouseVal`) using multiple features like median income, average rooms, and population. This involves:
1.  Defining features (X) and the target (y).
2.  Splitting the data into training and testing sets.
3.  Training the regression model.
4.  Evaluating its performance.

In [None]:
# 1. Define features (X) and target (y)
features = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']
X = df_housing[features]
y = df_housing['MedHouseVal']

# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Create and train the model
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)

# 4. Make predictions and evaluate
y_pred = regression_model.predict(X_test)
r2_score = metrics.r2_score(y_test, y_pred)

print(f"Model R-squared score: {r2_score:.4f}")

# Displaying the model coefficients
coefficients = pd.DataFrame(regression_model.coef_, X.columns, columns=['Coefficient'])
print("\nModel Coefficients:")
display(coefficients)