# Welcome to Python for Data Science!
Today we'll exploring the world of Pandas. We will be using a timeseries dataset from the UCI Machine Learning repository to learn to how to wrangle data and perform analysis. In the second half of the day you'll do some feature engineering and finally there will be room to build a small model to actually put your features to use!

<img src="https://miro.medium.com/max/1400/1*1oVjIRY3Bnmbw-idCtg4BQ@2x.jpeg" style="width: 60%;"/>

## In this part we'll cover the following topics:
- Data ingestion
- Data cleaning
- Data exploration (specifically timeseries) 
- Storing your data (you'll need it in the next part). 

## Downloading data from UCI
Use the extract method that is applicable for you. If you have any issues with these steps, please ask for help.

In [None]:
! mkdir -p data && curl -L https://github.com/JelmerOffenberg/datamind-python/blob/master/data/dataset.zip?raw=true --output ../data/dataset.zip && unzip ../data/dataset.zip -d ../data && mv ../data/household_power_consumption.txt ../data/household_power_consumption.csv 

## Getting started
First things first, import the following libraries: `pandas`, `matplotlib` and `numpy`

In [None]:
# Your solution

---

# Ingestion
The first step in the data science process is data collection. In this section we'll load in some data. Information about the dataset that we'll be loading is located [here](https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption). 

**Exercise:**
* Load the power consumption dataset from the data folder
* Inspect the top 5 rows
* Check the data types of the dataframe and answer the following questions:
    * What's wrong with the data types in the info panel?
    * What data types would you expect for each column?
    * Does it make sense to have date and time columns separated?
    

In [None]:
# Your solution

---

# Data cleaning
In this section you're going to do some data cleaning. Since this is a timeseries dataset, you'll need to make it more usable before we can continue.
- Add date and time columns together and make a single datetime variable. Verify that your new column has the right data type and is correct. The result should be a datetime64\[ns\] object.
- Remove the date and time column from the dataset. 
- Have a look at our variable of interest; `Global_active_power`
    - What data type was assigned by pandas to this variable? Does this make sense?
    - Try casing the variable to the right data type, what do you see?
    - Can you explain this behaviour?
    - Fix the way you load data and rerun your code
    - What change occurred with the data types?
- If you read the dataset description in the previous step, it seems that we can construct a column with the total energy consumption based on the columns that are available. Do this and name this column `power`. 

    

In [None]:
# Your solution

## Missing values
Find out if there are weird of missing values in the data. Try to identify the rows and columns and answer the following questions.
- are there any missing values in the data set?
- if so, which column(s) contain missing values?
- if there are any missing values, extract the rows and the number of rows that contain missing values
- What would be a good way to deal with the missing values?
- Use your prefered approach to deal with the missing values. Validate if there are any missing values left in your data 

In [None]:
# Your solution

---

# Creating a timeseries
In this section we'll combine some data wrangling with data exploration. 
- The data contains a timeseries, however the current index of the dataframe does not really show this. Change the index to our newly created datetime column so that we end up with a datetime index.

In [None]:
# Your solution

## Plotting
- Having a dataframe with a timeseries index allows us to some funky tricks 
    - Plot the `power` column, is this useful?
    - Create the same plot on the following aggregation levels:
        - Mean per year
        - Mean per 3 months
        - Mean per week
        - Mean per week and median per week
    - Explore the use of `transform` on resampled data. What is the difference with running a normal aggregation function such as `.mean()` on resampled data? Show the differences below.
    - When would you `transform`?    

In [None]:
# Your solution

## Creating helper functions
Instead of changing our code every time, we can create a function that will return the plot(s) that we need. In addtion, creating a function allows you to test your code which a good software engineering practice. For more information on functions, see this [link](https://www.tutorialspoint.com/python/python_functions.htm). Write a function that can do the following:
- The function can take in a dataframe, one or more column names, the aggregation level and one or more aggregation metrics (for instance mean or sum). 
- The function should output the graph

In [None]:
# Your solution

---

# Assignment 1:
Now that you've seen how to wrangle the data and create some plots, it's time to find the answer to the following questions:
- Q1: Find the month with the most power consumption
- Q2: Find the top three weeks with the most power consumption
- Q3: What week had the largest difference between what was consumed at the most and the least? 
- Q4: What was the average power consumption on the weekends?
        

In [None]:
# Your solution

---

# Storage
In this last step, you're going to store the result of your dataset in `parquet` format. 
- What does this format do? What is the benefit of using this?
- You'll need the data in the next exercise so make sure to store it on **day** level. 

In [None]:
!pip install fastparquet

In [None]:
# Your solution

---