# Business Understanding

This project is based on the analysis of house sales data in King County, for real estate agencies to advise home owners on how performing a renovation might increase the estimated value of their homes *add the recommendations*

## Understanding the Data


From the Column Names file we are able to understand the King County Data Set lot better and understand what each column represent.


In [1]:
#import libraries
import pandas as pd

# Reset the float_format option to its default value
pd.options.display.float_format = None

In [2]:
#Importing the data
df = pd.read_csv('kc_house_data.csv')

FileNotFoundError: [Errno 2] No such file or directory: './data/kc_house_data.csv'

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.head(20)

In [None]:
df.describe()

In [None]:
# Visualize data
df.hist(figsize = (15,15), edgecolor = 'black');

Through viewing the data, we are able to see the different data types we are working with. and from this we need to clean the data

# DATA CLEANING

Here I will work on to clean the data and make it more useful. 
I will start by getting to know the missing values in the dataset. I will accomplish this by writing a function that will give me a number of missing values in each column and the percentage.

In [None]:
# the function below gets the missing values of the dataset.
def missing_values(data):
    # identify the total missing values per column
    # sort in order
    miss = data.isnull().sum().sort_values(ascending = False)

    # calculate percentage of the missing values
    percentage_miss = (data.isnull().sum() / len(data)).sort_values(ascending = False)

    # store in a dataframe
    missing = pd.DataFrame({"Missing Values": miss, "Percentage": percentage_miss}).reset_index()

    # remove values that are missing
    missing.drop(missing[missing["Percentage"] == 0].index, inplace = True)

    return missing

In [None]:
# I will now call the function and pass my dataset through it to get my desired output.
missing_values(df)

We get to know that the colmns with missing values are 'yr_renovated', 'waterfront' and 'view' with a percentage of; 17%, 11% and .2% respectively. 

We will now work on the missing values by filling them since they have a significance into our analysis. We will fill the missing values using 

I will write a function to help me get this done

In [None]:
# write a function to take the data and column name and fill the missing values with the mode
def fill_values(data, column_name):
    # get the mode of the column
    most_frequent_time = data[column_name].mode()[0]

    # fill the missing values with the most_frequent_time
    data[column_name].fillna(most_frequent_time, inplace= True)

    return data

In [None]:
# fill view column using the function
df = fill_values(df, 'view')
missing_values(df)

In [None]:
#  fill yr_renovated column using the function
df = fill_values(df, 'yr_renovated')
missing_values(df)

In [None]:
#  fill waterfront column using the function
df = fill_values(df, 'waterfront')
missing_values(df)

Here we have filled all the missing values we can confirm that by checking the whole dataset using .info()

In [None]:
df.info()

We have enough confirmation that the data does not have any missing values. 

We can now move to changing the data types of various columns in order to have a seamless analysis.

## Fix structural issues.

### Fix the datatypes.

In [None]:
# Convert 'id' to object
df['id'] = df['id'].astype('object')

# Convert 'date' to datetime
df['date'] = pd.to_datetime(df['date'])

# Convert 'waterfront', 'view', 'condition', and 'grade' to categorical
df['waterfront'] = df['waterfront'].astype('category')
df['view'] = df['view'].astype('category')
df['condition'] = df['condition'].astype('category')
df['grade'] = df['grade'].astype('category')

# Convert 'sqft_basement' to numeric (handle non-numeric values appropriately)
df['sqft_basement'] = pd.to_numeric(df['sqft_basement'], errors='coerce')

# Convert 'yr_renovated' to integer
df['yr_renovated'] = df['yr_renovated'].astype('Int64')

Using this we have changed all the datatypes of the columns needed to the appropriate datatypes. We can check the altered datatypes using .info()

In [None]:
df.info()

We now note that the datatypes are changed. We can now proceed.

you will notice that after changing the data types especially with the 'sqft_basement' when I changed it to numeric the values that could not be converted to numeric, they were turned to NaN while keeping the numeric values intact. 

This means that we will have to work on the missing values.

In [None]:
# I will first check the number and percentage of the missing values
missing_values(df)

The percentage of the missing values is 2%. I will drop the rows with missing values.


In [None]:
# Drop rows with missing values in the 'sqft_basement' column
df.dropna(subset=['sqft_basement'], inplace=True)

# The 'inplace=True' argument modifies the original DataFrame 'df' in place

In [None]:
# check the dataframe again
df.info()

## Unique values 

Here i willl check all the unique values in the dataframe and remove the unnecessary characters. 

In [None]:

def unique_characters_summary(data):
    # Initialize an empty DataFrame to store the results
    unique_chars_df = pd.DataFrame(columns=['Column', 'Unique Characters'])

    # Loop through each column in the DataFrame
    for column_name in data.columns:
        # Get unique characters in the column
        unique_chars = set("".join(data[column_name].astype(str)))

        # Store the result in the DataFrame
        unique_chars_df = unique_chars_df.append({'Column': column_name, 'Unique Characters': "".join(unique_chars)}, ignore_index=True)

    return unique_chars_df


In [None]:
unique_characters_summary(df)

Looking at the data we notice that all the columns have characters that are consistent with each value and the datatypes of the columns.

We can now move to check the outliers of the dataset.

### Look for outliers

I will use a code that will plot box plots of all the numeric columns of my dataset inorder to get a visual of the data is distributed.

In [None]:
# import the necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Define a list of column names for which you want to create box plots
columns_to_plot = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above']

# Set up the figure and axis for plotting
plt.figure(figsize=(12, 8))

# Loop through the selected columns and create box plots
for column in columns_to_plot:
    plt.subplot(2, 3, columns_to_plot.index(column) + 1)  # Create a subplot for each column
    sns.boxplot(data= df[[column]])
    plt.title(f'Boxplot of {column}')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

Since we have more numeric columns we will repeat the plotting and make our comments.


In [None]:
# Define a list of column names for which you want to create box plots
columns_to_plot = ['price', 'sqft_basement', 'yr_built', 'yr_renovated', 'sqft_living15', 'sqft_lot15']

# Set up the figure and axis for plotting
plt.figure(figsize=(12, 8))

# Loop through the selected columns and create box plots
for column in columns_to_plot:
    plt.subplot(2, 3, columns_to_plot.index(column) + 1)  # Create a subplot for each column
    sns.boxplot(data= df[[column]])
    plt.title(f'Boxplot of {column}')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()