## Goal
Use your data science knowledge to explore the data provided and create Linear Regression models to predict how many bikes will be rented based on historical information.
## Course Learning Outcomes (CLO) Assessed
- CLO #1 Explain common models and processing pipelines in machine learning applications
- CLO #2 Apply machine learning algorithms to design solutions for real problems
- CLO #4 Analyse results and solutions to verify their correctness and impact on decision making
## Assessment Criteria and Rubric
This assessment is about data exploration and reasoning over the modelling outcome.
You will likely face overfitting and will have to iterate between data engineering and modelling until you get satisfactory results.
It is taken for granted that the code must be readable and has comments indicating what are you doing. Poor readability will detract points from grading.


In [3]:
import pandas as pd

In [6]:
df = pd.read_csv('Energy_Consumption_Data.csv')
print(df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 16 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   House_ID                       100000 non-null  int64  
 1   Month                          99994 non-null   object 
 2   Area_sq_ft                     95000 non-null   float64
 3   Occupants                      100000 non-null  object 
 4   Heating_Type                   64737 non-null   object 
 5   Age_of_Building                100000 non-null  int64  
 6   Insulation_Quality             100000 non-null  object 
 7   Daily_Average_Consumption_kWh  100000 non-null  float64
 8   Season                         100000 non-null  object 
 9   Energy_Efficiency_Rating       100000 non-null  int64  
 10  Tariff_Type                    100000 non-null  object 
 11  Bill_Amount                    95000 non-null   float64
 12  Renewable_Energy_Installed     

In [7]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 16 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   House_ID                       100000 non-null  int64  
 1   Month                          99994 non-null   object 
 2   Area_sq_ft                     95000 non-null   float64
 3   Occupants                      100000 non-null  object 
 4   Heating_Type                   64737 non-null   object 
 5   Age_of_Building                100000 non-null  int64  
 6   Insulation_Quality             100000 non-null  object 
 7   Daily_Average_Consumption_kWh  100000 non-null  float64
 8   Season                         100000 non-null  object 
 9   Energy_Efficiency_Rating       100000 non-null  int64  
 10  Tariff_Type                    100000 non-null  object 
 11  Bill_Amount                    95000 non-null   float64
 12  Renewable_Energy_Installed     

Unnamed: 0,House_ID,Month,Area_sq_ft,Occupants,Heating_Type,Age_of_Building,Insulation_Quality,Daily_Average_Consumption_kWh,Season,Energy_Efficiency_Rating,Tariff_Type,Bill_Amount,Renewable_Energy_Installed,Temperature_Average,Power_Outages,Monthly_Consumption_kWh
0,7271,Jan,662.0,5,,27,Good,14.335539,Spring,2,Time-of-Use,88.741986,Yes,55.467771,1,501.23278
1,861,Oct,1253.0,2,,13,cverage,25.917839,Spring,3,Variable,92.35494,Yes,36.388473,1,742.352878
2,5391,Dec,,3,Gas,22,Poor,33.127843,Summer,6,Fixed,96.708638,No,94.550243,0,1018.708608
3,5192,Aug,1535.0,1,,16,Good,7.451494,Autumn,6,Variable,95.568663,No,62.862087,2,199.420939
4,5735,Sep,1336.0,5,,13,Excellent,34.649571,Winter,10,Variable,,No,93.559405,3,933.15137


In [35]:
# Useful variables
columnNames = ['House_ID', 'Month', 'Area_sq_ft', 'Occupants', 'Heating_Type',
    'Age_of_Building', 'Insulation_Quality',
    'Daily_Average_Consumption_kWh', 'Season', 'Energy_Efficiency_Rating',
    'Tariff_Type', 'Bill_Amount', 'Renewable_Energy_Installed',
    'Temperature_Average', 'Power_Outages', 'Monthly_Consumption_kWh'
    ]

def printColumnValues(columnTarget):
    values = {}
    for column in columnTarget:
        values[column] = df[column].unique()
    print(values)

printColumnValues(df, 'House_ID')
# Understand what is in the columns


KeyError: 'House_ID'

### Cleaning Data To-Do
After using the above function to see what kind of unique variables are within each of the columns. I have summarized how to approach the `data cleaning` 
- `House_ID` - Drop
- `Month` - Drop
- `Area_sq_ft` - Check Float
- `Occupants` - convert 'five' to 5
- `Heating_Type` - Word length is word
- `Age_of_Building` - remove --
- `Insulation_Quality` - Word length is word
- `Daily_Average_Consumption_kWh` - no **negatives**, float, no **Null**
- `Season` - Use this to sort
- `Energy_Efficiency_Rating` - 1-10 not too sure how useful
- `Tariff_Type` - Its fucked Word length is Words
- `Bill_Amount` - no **negatives**, float, no **Null**
- `Renewable_Energy_Installed` - Kind of useless
- `Temperature_Average` - not really that useful
- `Power_Outages` - needs to be accounted for
- `Monthly_Consumption_kWh` - cant figure out if its derived from daily consumption

In [33]:
# Drop columns that are not useful
# We do not need to know what house ID is consuming the energy
df.drop('House_ID', axis=1, inplace=True)
df.head() # Check if the column was dropped
# We do not need to know the month as season is a better general indicator for prediction
df.drop('Month', axis=1, inplace=True)
df.head() # Check if the column was dropped


Unnamed: 0,Area_sq_ft,Occupants,Heating_Type,Age_of_Building,Insulation_Quality,Daily_Average_Consumption_kWh,Season,Energy_Efficiency_Rating,Tariff_Type,Bill_Amount,Renewable_Energy_Installed,Temperature_Average,Power_Outages,Monthly_Consumption_kWh
0,662.0,5,,27,Good,14.335539,Spring,2,Time-of-Use,88.741986,Yes,55.467771,1,501.23278
1,1253.0,2,,13,cverage,25.917839,Spring,3,Variable,92.35494,Yes,36.388473,1,742.352878
2,,3,Gas,22,Poor,33.127843,Summer,6,Fixed,96.708638,No,94.550243,0,1018.708608
3,1535.0,1,,16,Good,7.451494,Autumn,6,Variable,95.568663,No,62.862087,2,199.420939
4,1336.0,5,,13,Excellent,34.649571,Winter,10,Variable,,No,93.559405,3,933.15137


After dropping both `House_ID` and `Month`, the dataframe is ready to  be cleaned.

In [None]:
df = df.dropna(subset=['Area_sq_ft'])  # Drop rows with NaN in 'Area_sq_ft' column
df = df[df['Area_sq_ft'].apply(lambda x: isinstance(x, float) and x > 0)]  # Keep rows where 'Area_sq_ft' is a float and > 0

In [None]:
df['Area_sq_ft'] = pd.to_float()
df.dropna(subset=['Area_sq_ft'], inplace = True)

# TODO -
- Set expected datatype of each row
- Delete  