# Chapter 1 Dataframes and Datasets

## Introduction to Data Wrangling

**"Data wrangling is the process of transforming and structuring data from its raw form into a desired format with the intent of improving data quality and making it more consumable and useful for analytics or machine learning. It is also sometimes called data munging."**

In Python, data wrangling is achieved using the 'pandas' library.

**To install the pandas library, use the following command:**

In command prompt: pip install pandas

In a notebook environment: !pip install pandas

In [None]:
# Install pandas
!pip install pandas

## Pandas

**"Pandas (commonly imported as pd) is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language."**

Pandas works with datasets and helps analyze them in detail. Therefore, it is essential to provide pandas with a dataset.

**Official documentation for pandas:** https://pandas.pydata.org/docs/reference/index.html

## Datasets

Textual datasets can be stored in many different formats, including:

1. CSV (Comma-Separated Values)
2. JSON (JavaScript Object Notation)
3. SQL (Structured Query Language) Relations
4. And many others

By far, the most commonly used format is CSV.

### Comma-Seperated Values

As the name suggests, the values are separated by commas. The dataset is divided into rows and columns, with the first row defining all the columns.

Ex:

        "Pokedex Number", "Name", "Type"
        1, "Bulbasaur", "Grass"
        2, "Ivysaur", "Grass"
        4, "Charmander", "Fire"
        7, "Squirtle", "Water"

More readable format:

| Pokedex Number | Name        | Type    |
|----------------|-------------|---------|
| 1              | Bulbasaur   | Grass   |
| 2              | Ivysaur     | Grass   |
| 4              | Charmander  | Fire    |
| 7              | Squirtle    | Water   |

**However, it's worth noting that when reading CSV files, pandas is flexible enough to separate values by any special character as specified by the user.**

**Best source for datasets, Kaggle:** https://www.kaggle.com/datasets

## Dataframes

Pandas works only upon a special data structure called Dataframes (commonly referred as df). To work with any dataset in any form, first it needs to be converted into a Dataframe. 

**"A Dataframe is a two dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes."**

To read a dataset and convert into a Dataframe we use the method **read_extension()**

## Input a .csv dataset into Pandas

To read a .csv file, we make use of the Pandas.read_csv() method. It converts the .csv file into a dataframe with which pandas can perform data analysis or manipulation.

Syntax: pandas.read_csv('Path to file')

Example:

In [None]:
# Import pandas
import pandas as pd

# Path to the file
path = "./datasets/sales.csv"

# Read the csv file and convert it into a dataframe
pd.read_csv(path)

# Capture the dataframe into a variable
df_sales = pd.read_csv(path)

# Print the dataframe
df_sales

In [None]:
# Try with a much bigger dataset "kc_house_data.csv" in the "datasets" folder.

df_houses = pd.read_csv("./datasets/kc_house_data.csv")

df_houses

## Simple properties/methods defined on DataFrames

**Official documentation for all the various methods defined on dataframes:** https://pandas.pydata.org/docs/reference/frame.html

1) **Dataframe.columns:** Returns the column/attribute labels of the dataframe.

Ex: df_sales.columns

In [None]:
# Columns in sales dataframe
df_sales.columns

In [None]:
# Columns in houses dataframe
df_houses.columns

2) **len(Dataframe):** Returns the no. of rows/records/tuples in the dataframe.

Ex: len(df_sales)

In [None]:
# Return the no. of rows in sales dataframe
len(df_sales)

In [None]:
# Return the no. of rows in houses dataframe
len(df_houses)

3) **Dataframe.shape:** Returns the no. of rows and cols in the dataframe.
    
Ex: df_sales.shape

In [None]:
# Return the shape of sales dataframe
df_sales.shape

# The output (7, 3) indicates that the dataframe has 7 rows and 3 columns

In [None]:
# Return the shape of houses dataframe
df_houses.shape

4) **Dataframe.size:** Returns the area/size (i.e, rows * cols) of dataframe.
    
Ex: df_sales.size

In [None]:
# Return the size of sales dataframe
df_sales.size

In [None]:
# Return the size of houses dataframe
df_houses.size

## Display property of Pandas and Subsetting of Dataframes by no. of rows

1) **pd.options.display.min_rows:** Max no. of rows to be displayed.

Ex: pd.options.display.min_rows = 10

In [None]:
# Display at max 10 rows
pd.options.display.min_rows = 10

In [None]:
# Now try printing sales dataframe
df_sales

In [None]:
# Now try printing houses dataframe
df_houses

2) **Dataframe.head(count):** Returns the first 'count' no. of rows of the dataframe. By default the value of count is 5.

Ex: df_sales.head()

In [None]:
# Print the first five rows in sales dataframe
df_sales.head()

In [None]:
# Print the first five rows in houses dataframe
df_houses.head()

In [None]:
# Store the first 15 rows in houses dataframe into a new dataframe called 'df_15houses'
df_15houses = df_houses.head(15)

df_15houses

3) **Dataframe.tail(count):** Returns the last 'count' no. of rows of the dataframe. By default the value of count is 5.

Ex: df_sales.tail()

In [None]:
# Print the last five rows in sales dataframe
df_sales.tail()

In [None]:
# Print the last five rows in houses dataframe
df_houses.tail()

In [None]:
# Store the last 15 rows in houses dataframe into a new dataframe called 'df_15lasthouses'
df_15lasthouses = df_houses.tail(15)

df_15lasthouses

## Datatypes of columns in Dataframes

Pandas by default while converting the dataset into a dataframe, analyzes the type of each column in the dataset and assigns appropriate datatype to it.

1) **Dataframe.dtypes:** To know the datatypes assigned to each column.

Ex: df_sales.dtypes

In [None]:
# Information about datatypes in sales dataframe
df_sales.dtypes

In [None]:
# Information about datatypes in houses dataframe
df_houses.dtypes

2) **Dataframe.info():** To know the datatypes assigned to each column in the dataframe, the no. of non-null records of the specified column and the total memory occupied by the dataframe.

Ex: df_sales.info()

In [None]:
# Print sales dataset
df_sales.head()

In [None]:
# Information about datatypes, no. of non-null records in sales dataframe
df_sales.info()

In [None]:
# Print houses dataset
df_houses.head()

In [None]:
# Information about datatypes, no. of non-null records in houses dataframe
df_houses.info()

## Basic Dataframe Analysis Methods

| Method   | Description                                           |
|:---------|:------------------------------------------------------|
| sum      | Returns the sum of values in the DataFrame.           |
| min      | Returns the minimum value in the DataFrame.           |
| max      | Returns the maximum value in the DataFrame.           |
| count    | Returns the count of non-null values in the DataFrame.|
| mean     | Returns the mean of values in the DataFrame.          |
| median   | Returns the median of values in the DataFrame.        |
| mode     | Returns the mode of values in the DataFrame.          |
| describe | Returns a DataFrame with statistical information like mean, standard deviation, minimum, maximum, and quartiles. |

In [None]:
# Print houses dataset
df_houses.head()

In [None]:
# Find the sum of all the columns in houses dataset
df_houses.sum()

# Note: Peforms string concatenation for Date/String datatypes

In [None]:
# Find the sum of all the numeric columns in houses dataset
df_houses.sum(numeric_only=True)

In [None]:
# Find the min among all the columns in houses dataset
df_houses.min()

In [None]:
# Find the max among all the columns in houses dataset
df_houses.max()

In [None]:
# Find the count of all the columns (NA values excluded) in houses dataset
df_houses.count()

In [None]:
# Print sales dataset
df_sales.head()

In [None]:
# Find the count of all the columns (NA values excluded) in sales dataset
df_sales.count()

In [None]:
# Find the mean of all the columns in houses dataset
# df_houses.mean()

#### Important Note: The mean, median and mode of a dataset could be found only for numerical data. 

The columns that contain non-numeric data are known as nuisance columns, which need to be removed before calculating them. Fortunately, pandas by default returns only for non-numeric data, but it may not in the near future.

Ex: 

df_numeric_sales = df_sales.select_dtypes(include="number")

df_numeric_sales.mean()

In [None]:
# Extract numeric columns from dataset
df_numeric_houses = df_houses.select_dtypes(include="number")

# Calculate the mean
df_numeric_houses.mean()

In [None]:
# Find the median of all the columns in houses dataset
df_numeric_houses.median()

In [None]:
# Find the mode of all the columns in houses dataset
df_numeric_houses.mode()

In [None]:
# Describe the statistical information of houses dataset - Always runs only on numeric columns
df_houses.describe()