<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/029__Data_Aggregation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 4/6: DATA CLEANING AND ANALYSIS

# MISSION 1: Data Aggregation

Learn how to aggregate data with pandas.

## 1. Introduction

So far, we've learned how to use the pandas library and how to create visualizations with data sets that didn't require much cleanup. However, most data sets in real life require extensive cleaning and manipulation to extract any meaningful insights. In fact, [Forbes](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#71dc8e8a6f63) estimates that data scientists spend about 60% of their time cleaning and organizing data, so it's critical to be able to manipulate data quickly and efficiently.

In this course, we'll learn the following:

- Data aggregation
- How to combine data
- How to transform data
- How to clean strings with pandas
- How to handle missing and duplicate data

You'll need some basic knowledge of pandas and matplotlib to complete this course, including:

- Basic knowledge of **pandas dataframes and series**
- How to **select values** and **filter a dataframe**
- Knowledge of **data exploration methods** in pandas, such as the `info` and `head` methods
- How to **create visualizations** in pandas and matplotlib

All of these prerequisites are taught in our Pandas and NumPy Fundamentals, Exploratory Data Visualization, and Storytelling Through Data Visualization courses. If you haven't completed those courses and aren't comfortable with the concepts above, we suggest completing them before continuing here.

In this course, we'll work with the **World Happiness Report**, an annual report created by the UN Sustainable Development Solutions Network with the intent of guiding policy. The report assigns each country a happiness score based on the answers to a poll question that asks respondents to rank their life on a scale of 0 - 10.

It also includes estimates of factors that may contribute to each country's happiness, including economic production, social support, life expectancy, freedom, absence of corruption, and generosity, to provide context for the score. Although these factors aren't actually used in the calculation of the happiness score, they can help illustrate why a country received a certain score.

Throughout this course, we'll work to answer the following questions:

- **How can aggregating the data give us more insight into happiness scores?**
- **How did world happiness change from 2015 to 2017?**
- **Which factors contribute the most to the happiness score?**

In this mission, we'll start by learning **how to aggregate data**. Then in the following missions, we'll learn different data cleaning skills that can help us **aggregate and analyze the data in different ways**. We'll start by learning each topic in isolation, but build towards a more **complete data cleaning workflow by the end of the course**.



##2. Introduction to the Data

Let's start by looking at the World Happiness Report for 2015. You can find the data [here](https://www.kaggle.com/unsdsn/world-happiness).

Let's load the data set below:


In [1]:
# Import functions from Google modules into Colaboratory
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [2]:
# Insert file id from Google Drive shareable link:
# https://drive.google.com/file/d/1iZ8_lHkMx7pI22s4ECfpNHKnOohyPfvU/view?usp=sharing
id = "1iZ8_lHkMx7pI22s4ECfpNHKnOohyPfvU"

In [3]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('World_Happiness_2015.csv')

In [4]:
# Import pandas library
import pandas as pd
import numpy as np

In [5]:
 # Read the csv file
 happiness2015 = pd.read_csv("World_Happiness_2015.csv")

### **Exploring the data**
Let's render the first few and last few values of this pandas object, by running the `titanic` variable in a separate cell.

In [6]:
# Render the first 5 rows of the autos dataframe
happiness2015.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204
3,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176


Below is a preview of the data set:

##3. Using Loops to Aggregate Data

##4. The GroupBy Operation

##5. Creating GroupBy Objects

##6. Exploring GroupBy Objects

##7. Common Aggregation Methods with Groupby

##8. Aggregating Specific Columns with Groupby

##9. Introduction to the Agg() Method

##10. Computing Multiple and Custom Aggregations with the Agg() Method

##11. Aggregation with Pivot Tables

##12. Aggregating Multiple Columns and Functions with Pivot Tables