<a href="https://colab.research.google.com/github/Silviatulli/DataScience_PublicPolicy/blob/main/Data_Cleaning_Manipulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

title: Data Cleaning and Manipulation\
summary: importing data, importing libraries, creating a dataframe, handling missing data, filtering data, exploratory data analysis, pandas, numpy, matplotlib and seaborn libraries\
main reference: [Pandas documentation](https://pandas.pydata.org/docs/getting_started/index.html), [NumPy documentation](https://numpy.org/devdocs/), [Matplotlib documentation](https://matplotlib.org/stable/index.html), [Seaborn Documentation](https://seaborn.pydata.org/)

scope: Data Science for Public Policy Course\
last update: 2024-03-12

# **Data Cleaning and Manipulation**

**Table of Contents**
* Import data
  * Mount Google Drive
  * Upload data files directly
* Import packages and libraries
* Data Cleaning and Manipulation: Pandas library
  * Access your data
  * Display the First Few Rows
  * Data Types and Missing Values
  * Summary Statistics
  * Drop Columns
  * Isolate variables
  * Group dataset with respect to a variable
* Data Visualisation: Matplotlib and Seaborn libraries
  * Histogram
  * Box Plot
  * Scatter Plot
  * Heatmap

**On your own**


# **Import data**

## Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Upload data files directly
In the toolbar at the top of the notebook, click on the "Files" button. This will display the Files sidebar on the left-hand side of the notebook. You can upload files by clicking on the "Upload" button and selecting the files you want to upload from your local machine.

# **Import packages and libraries**
In Python, a **package** is a **collection of modules (or sub-packages)** that are grouped together and can be easily distributed and installed. **Libraries**, on the other hand, are **collections of code and routines** that provide specific functionality.

To load the pandas package, or any other python package, and start working with it, **import the package**. The community agreed alias for pandas is pd, so loading pandas as `pd` is assumed standard practice for all of the pandas documentation.

In [2]:
import pandas as pd
#as we imported pandas as "pd", we will later refer to it as "pd"
#for instance, when we want to use one of the functions of the library
#note that we could also write "import pandas as PANDAROUX", but it would take more time to type "PANDAROUX" everytime we want to use it

In [3]:
import numpy as np

In [4]:
import matplotlib.pyplot as plt

## Access your data with pandas

In [6]:
data = pd.read_csv('/content/drive/MyDrive/house_prices.csv')

FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/house_prices.csv'

In [None]:
data = pd.read_csv('/content/drive/MyDrive/house_prices.csv')

You can also import your data from an URL pointing to CSV file

In [None]:
# URL pointing to the CSV file
url = 'http://example.com/data.csv'

# Read data from URL into DataFrame
df = pd.read_csv(url)

## Display the First Few Rows
View the first few rows of your dataset to understand the data structure and variable names:

In [None]:
data.head()

## **Data Types and Missing Values**
Check the data types of each column and identify missing values:

In [None]:
data.dtypes

In [None]:
data.isnull().sum()

## Summary Statistics
Calculate summary statistics to gain insights into the central tendencies and distributions of your data:


In [None]:
data.describe()

For a categorical variable, you can get the unique values and their counts:

In [None]:
#data['categorical_column'].value_counts()
data['SaleType'].value_counts()

# **Data Visualization**
Now, let's explore various types of data visualizations to uncover patterns and relationships in your data.

## Histogram
Visualize the distribution of a numerical variable using a histogram:

In [None]:
#plt.hist(data['numeric_column'], bins=20)
plt.hist(data['LotArea'], bins=20)
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of LotArea')
plt.show()

## Box Plot
Create a box plot to understand the spread and central tendency of numerical data:

In [None]:
plt.boxplot(data['LotArea'])
plt.xlabel('Value')
plt.title('Box Plot of LotArea')
plt.show()

## Scatter Plot
Investigate the relationship between two numerical variables:

In [None]:
#plt.scatter(data['numeric_column1'], data['numeric_column2'])
plt.scatter(data['LotArea'], data['GarageArea'])
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt.show()

## Bar Chart
Visualize the distribution of a categorical variable:

In [None]:
#data['categorical_column'].value_counts().plot(kind='bar')
data['SaleType'].value_counts().plot(kind='bar')
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Bar Chart of Categorical Column')
plt.show()

## Heatmap (Correlation)
Explore correlations between numerical variables with a heatmap:

In [None]:
#let's import seaborn
import seaborn as sns

In [None]:
correlation_matrix = data.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()