# Module 7: Real-World Data Analysis Projects


### Section 1: Data Loading and Initial Exploration

**Objective:**  
In this section, we initiate our guided analysis project by loading the California Housing Prices dataset and conducting an initial exploration of its structure.

**Overview:**  
To start our analysis, we need to obtain the dataset and understand its contents. We load the dataset using scikit-learn's `fetch_california_housing` function with the option `as_frame=True` to fetch it as a pandas DataFrame. This allows us to work with the data efficiently.

Next, we create a DataFrame named `df` from the dataset, including feature names and the target variable. By doing this, we can easily manipulate and analyze the data.

To ensure everything is functioning correctly and to get a sense of the dataset's structure, we display the first few rows using `df.head()`. This initial exploration helps us see what features are available and gain an overall understanding of the dataset's format.

By the end of this section, we will have a DataFrame ready for further analysis, setting the foundation for our guided project on California housing prices.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing

# Load the California Housing Prices dataset
data = fetch_california_housing(as_frame=True)

# Create a DataFrame from the dataset
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target  # Adding the target variable to the DataFrame

# Display the first few rows of the dataset to understand its structure
df.head()

## Glossary of Features in the California Housing Dataset

Let's explore the features present in the California Housing dataset to understand their meaning and relevance:

1. **MedInc (Median Income):** This feature represents the median income of households in a specific geographical area (in units of tens of thousands of dollars). It is an important socioeconomic indicator and can influence housing prices.

2. **HouseAge (Housing Median Age):** HouseAge is the median age of houses within a block, expressed in years. Older houses might be less expensive compared to newer ones.

3. **AveRooms (Average Rooms):** AveRooms is the average number of rooms per housing unit in a given area. It can provide insights into the housing density and size.

4. **AveBedrms (Average Bedrooms):** AveBedrms represents the average number of bedrooms per housing unit. It gives an idea of the typical bedroom count in the region.

5. **Population:** Population indicates the number of people residing in a geographical area. It can impact housing demand and prices.

6. **AveOccup (Average Occupancy):** AveOccup is the average household occupancy, which is the ratio of the population to the number of households. It can provide insights into housing density.

7. **Latitude:** Latitude specifies the geographic latitude of the location, which can be important for regional climate and desirability.

8. **Longitude:** Longitude represents the geographic longitude of the location, providing information about the geographical positioning.

9. **Target (Median House Value):** The Target variable represents the median house value in a specific area (in units of hundreds of thousands of dollars). This is the target variable for regression tasks, and it's what we aim to predict.

### Understanding the Dataset:

- **MedInc (Median Income)** and **HouseAge (Housing Median Age)** may be correlated with housing prices. Higher median income areas might have more expensive houses, and newer houses might be pricier.

- **AveRooms (Average Rooms)** and **AveBedrms (Average Bedrooms)** provide information about the size and layout of houses. Areas with larger average rooms or bedrooms might have more spacious properties.

- **Population** and **AveOccup (Average Occupancy)** are related to housing density. High population areas with lower occupancy rates may indicate larger properties or lower demand for housing.

- **Latitude** and **Longitude** offer geographical information that can be used for spatial analysis and understanding regional variations in housing prices.

### Section 2: Data Cleaning and Initial Analysis

**Objective:**  
In this section, we perform data cleaning and conduct an initial analysis of the California Housing Prices dataset.

**Steps:**  

**1. Checking for Missing Values:**  
We begin by checking for any missing values in the dataset. Missing values can significantly impact our analysis and modeling. By calculating the sum of missing values for each feature using `df.isnull().sum()`, we ensure data completeness.

**2. Handling Missing Values (if any):**  
Fortunately, in this dataset, no missing values are found. Therefore, there is no need for further data cleaning or imputation.

**3. Summary Statistics:**  
To gain insights into the distribution and central tendencies of the numerical features, we calculate summary statistics using `df.describe()`. This step provides us with key statistical information, including mean, standard deviation, minimum, maximum, and quartiles.

**4. Feature Distribution Visualization:**  
Visualizing the distribution of features is essential to understand the characteristics of the data. We create a grid of histograms for all numerical features, showing the frequency distribution of each. This helps us identify patterns, potential outliers, and the overall shape of the data.


In [None]:
# Check for missing values
if not df.isnull().any().any():
    print("There are no missing values in this dataset")
else:
    print(f"There are {df.isnull().sum()} missing values in this dataset")

# Data Cleaning: Handling Missing Values (if any)
# No missing values found in this dataset, so no further data cleaning is needed

# Summary statistics for numerical features
summary_stats = df.describe()
print(summary_stats)

# Feature Distribution Visualization
plt.figure(figsize=(16, 10))

# Plot histograms for all numerical features
for i, feature in enumerate(data.feature_names):
    plt.subplot(3, 3, i + 1)
    plt.hist(df[feature], bins=30, edgecolor='k', alpha=0.7)
    plt.title(f'Distribution of {feature}')
    plt.xlabel(feature)
    plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

### Section 3: Exploring Feature Relationships

**Objective:**  
In this section, we explore the relationships between numerical features in the California Housing Prices dataset by creating a correlation matrix and visualizing it as a heatmap.

**Steps:**  

**1. Calculating the Correlation Matrix:**  
Understanding how features are related to each other is crucial in data analysis. To do this, we calculate the correlation matrix using `df.corr()`. This matrix provides correlation coefficients for all pairs of numerical features, showing the strength and direction of their linear relationships.

**2. Creating a Heatmap for Visualization:**  
A heatmap is an effective way to visually represent the correlation matrix. We create a heatmap using the Seaborn library with the `sns.heatmap()` function. It allows us to color-code the correlation values, making it easier to identify strong positive (closer to 1) and negative (closer to -1) correlations. The `annot=True` parameter adds correlation values to the heatmap cells, improving interpretability.

**3. Interpretation:**  
The resulting heatmap provides insights into how numerical features in the dataset are correlated. Darker squares indicate stronger correlations, while lighter squares suggest weaker or no correlations. This visualization helps identify potential multicollinearity (high correlations between independent variables), which can impact predictive modeling.


In [None]:
# Calculate the correlation matrix for numerical columns
correlation_matrix = df.corr()

# Create a heatmap to visualize correlations
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

## Interpreting the Correlation Heatmap

Now that we have generated the correlation heatmap for the California Housing dataset, let's interpret the heatmap and gain insights into the relationships between numerical features.

### Interpretation:

The heatmap provides a visual representation of the correlation coefficients between numerical features. Each square in the heatmap represents the correlation between two features, and the color intensity indicates the strength and direction of the correlation:

- **Positive Correlation:** Features that have a positive correlation show a tendency to move together. If one feature increases, the other tends to increase as well, and vice versa. Positive correlations are represented by dark red squares in the heatmap.

- **Negative Correlation:** Features with a negative correlation move in opposite directions. When one feature increases, the other tends to decrease. Negative correlations are shown as dark blue squares.

- **No Correlation:** Features that have little to no correlation appear as light or white squares in the heatmap. This indicates that changes in one feature do not significantly affect the other.

### Using the Heatmap:

- Identifying Relationships: Examine the dark red and dark blue squares to identify pairs of features that have strong correlations. These relationships can be crucial when selecting features for modeling.

- Multicollinearity: Be cautious about strong positive or negative correlations between features, as they may indicate multicollinearity. Multicollinearity can affect the stability and interpretability of predictive models.

- Feature Selection: Use the heatmap to guide feature selection by focusing on features that are strongly correlated with the target variable or have meaningful correlations with other features.

- Data Insights: The heatmap can provide valuable insights into the dataset's structure and highlight which features may have predictive power.

Remember that correlation does not imply causation. A strong correlation between two features does not necessarily mean one causes the other; it indicates a statistical relationship. Use these insights to inform your data analysis and modeling decisions.


### Section 4: Polynomial Regression for Nonlinear Analysis

Polynomial regression is a regression technique used when the relationship between the independent variable(s) and the target variable is nonlinear. In this context, we're applying polynomial regression to the California housing dataset to explore potential nonlinear relationships between median income (MedInc) and median house values (Target).

#### Code Explanation:

- We load the California Housing Prices dataset, similar to our previous exploration.
- We select "Median Income" as our predictor variable (X) and "Median House Value" as our target variable (y).
    - **Predictor Variable** is often referred to as an independent variable or feature, is a data attribute or input used in statistical or machine learning models to predict or explain changes in the target variable.
- We specify the degree of the polynomial, which determines the complexity of the model. Here, we use a degree of 2, which means we'll fit a quadratic (second-degree) polynomial.
    - **Determining the Polynomial Degree** The degree of the polynomial in polynomial regression is determined by us, the data analysts or scientists. It represents the order of the polynomial equation that will be fitted to the data. For example, a degree of 2 corresponds to a quadratic equation, degree 3 to a cubic equation, and so on.
    - **Choosing the Right Polynomial Degree** Selecting the appropriate polynomial degree is crucial. A lower-degree polynomial may underfit the data, meaning it won't capture complex relationships, while a higher-degree polynomial may overfit the data, fitting noise in the dataset rather than the true underlying pattern. 
    - **In Our Example**, we specified a degree of 2, which means we chose to fit a quadratic (second-degree) polynomial to our data. This allows our model to capture a curved relationship between the predictor variable (e.g., "Median Income") and the target variable (e.g., "House Value"). The choice of degree should be based on a balance between model complexity and its ability to represent the underlying patterns in the data.

- We create a polynomial regression model using `PolynomialFeatures` and `LinearRegression` from Scikit-Learn. This model transforms our predictor variable into polynomial features and fits a linear regression model to it.
    - **Polynomial Regression Model** Polynomial regression is a regression technique used to model relationships between variables that aren't linear. It allows us to capture nonlinear patterns by transforming the predictor variable(s) into polynomial features.
    - **PolynomialFeatures** PolynomialFeatures is a preprocessing step provided by Scikit-Learn. It takes an original predictor variable and transforms it into polynomial features. For example, if we have a second-degree polynomial, it will create new features like "Median Income squared" and "Median Income cubed."
    - **LinearRegression** LinearRegression, also from Scikit-Learn, is used in polynomial regression to fit a linear equation to the polynomial features generated by PolynomialFeatures. Despite the polynomial features, LinearRegression applies a linear model in terms of coefficients.
    - **Predictor Variable** The predictor variable, in this context, is the original variable we want to use for making predictions. In our example, it's "Median Income." PolynomialFeatures takes this predictor variable and generates polynomial terms from it.
    - **Fits to a Linear Regression Model**
After transforming the predictor variable into polynomial features, we use LinearRegression to fit a linear equation to these polynomial features. This linear equation, with its coefficients, serves as our model. It allows us to make predictions based on the polynomial terms of the original predictor variable, capturing nonlinear relationships within the data.

- After fitting the model, we make predictions and plot the actual house prices (scatter) and the polynomial regression curve (red line).

#### Interpretation:

- Polynomial regression can capture more complex relationships than linear regression. In this case, we aim to capture potential nonlinear trends in the relationship between median income and median house values.
- The resulting curve visually represents the fitted polynomial regression model. In the plot, you can observe how the polynomial curve attempts to capture the nonlinear patterns in the data.
- Polynomial regression allows us to explore whether higher degrees of polynomials (e.g., cubic, quartic) might provide a better fit to the data and reveal more complex relationships.

Using polynomial regression, we can uncover nonlinear insights within the dataset, complementing our initial linear analysis. This technique is valuable when linear models fail to capture the underlying relationships effectively.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Load the California Housing Prices dataset
data = fetch_california_housing(as_frame=True)

# Create a DataFrame from the dataset
df = pd.DataFrame(data.data, columns=data.feature_names)
df['Target'] = data.target  # Adding the target variable to the DataFrame

# Selecting predictor and target variables
X = df[['MedInc']]
y = df['Target']

# Perform Polynomial Regression
degree = 2  # Degree of the polynomial
polyreg = make_pipeline(PolynomialFeatures(degree), LinearRegression())
polyreg.fit(X, y)

# Predict housing prices
y_pred = polyreg.predict(X)

# Plot the results
plt.scatter(X, y, label='Actual Prices', s=10)
plt.plot(X, y_pred, color='red', label='Polynomial Regression (Degree 2)')
plt.xlabel('Median Income')
plt.ylabel('Median House Value')
plt.legend()
plt.show()

The graph illustrates the relationship between the median income (on the x-axis) and the median house value (on the y-axis) for the California Housing Prices dataset. The red line represents the polynomial regression model (degree 2) that has been fitted to the data.

Here's what you should understand from the graph:

1. **Data Distribution**: The scattered blue points on the graph represent the actual median house values for different levels of median income. These points give you an idea of how the data is distributed across the income and housing value ranges.

2. **Regression Line**: The red curve represents the polynomial regression line, specifically a quadratic (second-degree) polynomial. It shows how the model predicts median house values based on median income. The curve captures the overall trend and relationship between these two variables.

3. **Model Fit**: The curve attempts to fit the data points as closely as possible, and its shape is determined by the degree of the polynomial chosen for regression. In this case, a quadratic polynomial (degree 2) has been used, resulting in a parabolic curve.

4. **Predictive Power**: The polynomial regression model can be used for predictions. Given a median income value, you can use the curve to estimate the corresponding median house value. For example, if you have a median income of $4, the model can predict the expected median house value based on the curve's position at that point.

5. **Trend**: You can observe the general trend that as median income increases, median house value tends to rise as well, which is an intuitive relationship. The curve captures this trend and the associated variability in house values.

6. **Residuals**: The differences between the actual data points (blue) and the points on the curve (red) are called residuals. They represent the model's errors or how well it fits the data. Smaller residuals indicate a better fit.

In summary, this graph helps you visualize how a polynomial regression model fits the data and how it can be used to make predictions based on the relationship between median income and median house value. It's a valuable tool for understanding and modeling such relationships in real-world datasets.
