# A Sample Data Analysis and Modeling Project

### 1. Data Loading

First, let's load a sample dataset into a pandas DataFrame. I'll use the `california_housing_train.csv` file available in the Colab environment.

In [None]:
import pandas as pd

# Load the California Housing dataset
df = pd.read_csv('/content/sample_data/california_housing_train.csv')

# Display the first 5 rows of the DataFrame
display(df.head())

### 2. Initial Data Exploration

Next, let's get a basic understanding of the dataset, including its structure, data types, and summary statistics.

In [None]:
# Display concise summary of the DataFrame
df.info()

# Display descriptive statistics
display(df.describe())

### 3. Data Cleaning - Handling Missing Values

Let's check for any missing values in the dataset and decide on a strategy to handle them. For this example, if there are missing values, I'll fill them with the mean of their respective columns.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
display("Missing values before cleaning:", missing_values[missing_values > 0])

# Fill missing values (if any) with the mean of the column
for column in df.columns:
    if df[column].isnull().any():
        df[column] = df[column].fillna(df[column].mean())

# Re-check for missing values after cleaning
missing_values_after = df.isnull().sum()
display("Missing values after cleaning:", missing_values_after[missing_values_after > 0])

### 4. Data Visualization - Exploring Distributions

Visualizing the distribution of key features can provide insights into the data. Let's look at the distribution of the 'median_house_value' and 'median_income'.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style for the plots
sns.set_style("whitegrid")

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1) # 1 row, 2 columns, first plot
sns.histplot(df['median_house_value'], kde=True)
plt.title('Distribution of Median House Value')
plt.xlabel('Median House Value')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2) # 1 row, 2 columns, second plot
sns.histplot(df['median_income'], kde=True)
plt.title('Distribution of Median Income')
plt.xlabel('Median Income')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()