# **Data Analysis Logbook for ACC ML CS Club projects**

## **Introduction**

My name is Ian Olsen and this will be my log book to keep track of all data related to ML projects for Dr. Mohsin.

## **Initial task**
The first take is to perform an Exploratory Data Analysis (EDA) on a dataset related to housing. The purpose of this analysis is to gain insights into the data, clean the dataset, handle missing values, and prepare it for further modeling.


1. **Initial Exploration**: Confirming the data types of each feature, checking for missing values, and visualizing the data to understand its distribution.
2. **Data Cleaning**: Handling missing values by either dropping rows and/or columns or imputing them and converting categorical features into usable formats.
3. **Correlation Analysis**: Checking for correlations between features and the target attribute to identify potential relationships.
4. **Data Preparation**: Splitting the dataset into training and test sets for model evaluation.

## **Dataset Description**

The dataset used in this analysis contains information about various aspects of housing. Each row represents a different housing unit, and the columns include features such as:

- **longitude**: The longitude of the housing unit's location
- **latitude**: The latitude of the housing unit's location
- **housing_median_age**: The median age of the houses in the area
- **total_rooms**: The total number of rooms in the housing unit
- **total_bedrooms**: The total number of bedrooms in the housing unit
- **population**: The population of the area where the housing unit is located
- **households**: The number of households in the area
- **median_income**: The median income of the households in the area
- **median_house_value**: The median value of the houses in the area
- **ocean_proximity**: The proximity of the housing unit to the ocean (categorical feature)


## **Objectives**

The main objectives of this analysis are:
- To understand the structure and distribution of the data.
- To identify and handle missing values appropriately.
- To encode categorical variables for model compatibility.
- To explore relationships between features and a target variable.
- To prepare the dataset for training a ML model.

## **Tools and Libraries**

We will be using the following tools and libraries to perform the analysis:
- **Pandas**: For data manipulation and analysis
- **NumPy**: For numerical operations
- **Matplotlib & Seaborn**: For data visualization
- **Scikit-Learn**: For data preprocessing and model evaluation

## **Getting Started**

To begin the analysis, we first need to load the necessary libraries and import the datasets(training and test) as needed.



## Python version assertion and Scikit package importation.



In [16]:
import sys

assert sys.version_info >= (3, 7)

In [18]:
from packaging import version
import sklearn
assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

## Data importation from GitHub

In [13]:
#import libraries Pandas, pathlib(Path), tarfile, and urllib's request
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

#Method that handles loading housing data from unzipped tarball
def load_housing_data():
  tarball_path = Path("datasets/housing.tgz")

  #Check that tarball already exists and if not download it
  if not tarball_path.is_file():
    Path("datasets").mkdir(parents=True, exist_ok=True)
    url = "https://github.com/ageron/data/raw/main/housing.tgz"
    print(f"Downlaoding from {url}...")
    urllib.request.urlretrieve(url, tarball_path)

    #Extract tarball
    try:
      with tarfile.open(tarball_path) as housing_tarball:
        print("Extracting tarball...")
        housing_tarball.extractall(path="datasets")
    except tarfile.TarError as e:
      print(f"Error extracting tarball: {e}")
      return None

  #Load the CSV file
  csv_path = Path("datasets/housing/housing.csv")
  if not csv_path.is_file():
    print(f"CSV fle is not located in expected location: {csv_path}")
    return None

  #Return the CSV
  return pd.read_csv(csv_path)

#Create new variable from calling the load_housing_data method
housing = load_housing_data()

if housing is not None:
  print("Housing data has loaded successfully:")
  print(housing.head())

else:
  print("Housing data failed to laod.")


Downlaoding from https://github.com/ageron/data/raw/main/housing.tgz...
Extracting tarball...
Housing data has loaded successfully:
   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       25

## Taking some moments to explore the data.


In [19]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [20]:
housing['population'].value_counts()

population
891.0      25
761.0      24
850.0      24
1052.0     24
1227.0     24
           ..
3563.0      1
2878.0      1
10323.0     1
5217.0      1
6912.0      1
Name: count, Length: 3888, dtype: int64

In [21]:
housing['ocean_proximity'].value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

In [23]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [26]:
housing.sample(13)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
17226,-118.02,34.12,37.0,2250.0,360.0,989.0,329.0,6.1536,366000.0,INLAND
1995,-118.08,33.91,30.0,3259.0,942.0,2744.0,895.0,2.8608,165600.0,<1H OCEAN
9769,-118.7,34.24,28.0,2405.0,462.0,1011.0,378.0,4.504,204300.0,<1H OCEAN
17982,-122.31,37.54,45.0,1222.0,220.0,492.0,205.0,5.539,396900.0,NEAR OCEAN
206,-122.14,37.41,35.0,2419.0,426.0,949.0,433.0,6.4588,437100.0,NEAR BAY
1697,-120.43,34.69,33.0,2054.0,373.0,1067.0,358.0,3.6023,128300.0,NEAR OCEAN
3955,-122.03,37.53,18.0,1746.0,437.0,1268.0,404.0,3.256,183300.0,NEAR BAY
2683,-117.08,32.73,19.0,2935.0,763.0,1953.0,720.0,1.4254,111300.0,NEAR OCEAN
7312,-122.43,37.73,49.0,1435.0,322.0,1008.0,329.0,4.0,264000.0,NEAR BAY
14677,-117.57,34.15,3.0,12806.0,2219.0,4249.0,1499.0,5.485,343100.0,INLAND
