# <font color=#14F278>Unit 4 - Dataset Formatting</font>
---
So far, we learnt how to build Series and DataFrame objects, and how to select data from them via Indexing, Slicing, Masking and Column Selection. 

In this unit, we will learn how to load datasets from and to external files and perform simple formatting operations on them. We will also touch on the concept of Mutability.


In [2]:
import pandas as pd

---
## <font color=#14F278> 1. Importing Data from CSV and Excel Files:</font>


In practice, we use Pandas for tabular data, larger than a couple of columns and rows. Building the DataFrame manually and from scratch is no fun - luckily, we can simply provide a filepath to an existing data file in memory, and load it to a Pandas DataFrame.

Steps:
- first, store the <font color=#14F278>**file path**</font> in a `filepath` variable
- we can provide a full filepath - `Shift + right-click` on a file, then select `Copy as Path`
- we can also provide a <font color=#14F278>**relative file path**</font>:
    - if the file is stored in the same folder as our Jupyter Notebook or Python file, we can simply provide the filename as the file path
    - to navigate to parent folders, we can use the `../` structure in the filepath
- once the file path is stored, if the file is in a <font color=#14F278>**CSV**</font> format, use the `pd.read_csv()` function
- if the file is in an <font color=#14F278>**Excel**</font> format, use the `pd.read_excel()` function

In [3]:
# Example of Loading CSV data
# here we're using a relative filepath

filename = r'../data/insurance.csv'
df = pd.read_csv(filename)
display(df)

Unnamed: 0,Client ID,age,bmi,children,smoker,region,CHARGES
0,100,19,27.900,0,yes,southwest,16884.92400
1,101,18,33.770,1,no,southeast,1725.55230
2,102,28,33.000,3,no,southeast,4449.46200
3,103,33,22.705,0,no,northwest,21984.47061
4,104,32,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1341,1441,56,39.820,0,no,southeast,11090.71780
1342,1442,27,42.130,0,yes,southeast,39611.75770
1343,100,19,27.900,0,yes,southwest,16884.92400
1344,101,18,33.770,1,no,southeast,1725.55230


In [31]:
# Example of Loading Excel data
# here we're using a relative filepath

filename = r'../data/insurance.xlsx'
df = pd.read_excel(filename)
display(df)

Unnamed: 0,Client ID,age,bmi,children,smoker,region,CHARGES
0,100,19,27.900,0,yes,southwest,16884.92400
1,101,18,33.770,1,no,southeast,1725.55230
2,102,28,33.000,3,no,southeast,4449.46200
3,103,33,22.705,0,no,northwest,21984.47061
4,104,32,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1341,1441,56,39.820,0,no,southeast,11090.71780
1342,1442,27,42.130,0,yes,southeast,39611.75770
1343,100,19,27.900,0,yes,southwest,16884.92400
1344,101,18,33.770,1,no,southeast,1725.55230


---
## <font color=#14F278> 2. Renaming Columns:</font>
<font color=#14F278>**Column Names Irregularity**</font> is something we often come across when working with raw data. It is best practice to ensure all column names in a dataset follow the same <font color=#14F278>**naming convention**</font>.
- use the `rename(columns = {})` DataFrame method to rename one or multiple columns
    - the keys of the dictionary are the **current column names**
    - the values of the dictionary are the **new names** you want to impose


In [5]:
# Let's rename all relevant columns to ensure all columns are lowercase
# and if there are 2 or more words in a name, they are separated by _

df = df.rename(columns={'Client ID': 'client_id', 'CHARGES':'charges'})

# df.head() returns the first 5 rows from the dataset, allowing us to take a peak at the data
df.head()

Unnamed: 0,client_id,age,bmi,children,smoker,region,charges
0,100,19,27.9,0,yes,southwest,16884.924
1,101,18,33.77,1,no,southeast,1725.5523
2,102,28,33.0,3,no,southeast,4449.462
3,103,33,22.705,0,no,northwest,21984.47061
4,104,32,28.88,0,no,northwest,3866.8552


---
## <font color=#14F278> 3. Dropping Columns:</font>
In the previous unit we discussed **Column Selection** as one of the main 4 ways for subset selection on DataFrames. 
Column Selection allowed us to keep only a sub-part of the columns, relevant to our work. An alternative way of doing this is to <font color=#14F278>**drop any unneccesary columns**</font> from the DataFrame:
- use the `drop(columns = [])` DataFrame method
- store the names of the column (columns) you want to drop in the list argument

In [6]:
df = df.drop(columns = ['children'])
df.head()

Unnamed: 0,client_id,age,bmi,smoker,region,charges
0,100,19,27.9,yes,southwest,16884.924
1,101,18,33.77,no,southeast,1725.5523
2,102,28,33.0,no,southeast,4449.462
3,103,33,22.705,no,northwest,21984.47061
4,104,32,28.88,no,northwest,3866.8552


---
## <font color=#14F278> 4. Setting and Resetting Index:</font>

Sometimes, depending on the use case and the task at hand, we need to format the Index of the DataFrame. By Index, here we imply the row indices, uniquely identifying each observation (row) in the dataset.

There are two main things we can do in Pandas:
- <font color=#14F278>**Set the Index**</font> to be the values from a given column - done via the `set_index()` method
- <font color=#14F278>**Reset the Index**</font> of a DataFrame, which:
    - exports the existing index in the form of a **stand-alone column**
    - resets the row index to the default integer position
    - done via the `reset_index()` method

<center>
    <div>
        <img src="..\images\formatting_001.png"/>
    </div>
</center>


In [7]:
# Set the index to be the values under column 'client_id'
df = df.set_index('client_id')
df.head()

Unnamed: 0_level_0,age,bmi,smoker,region,charges
client_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
100,19,27.9,yes,southwest,16884.924
101,18,33.77,no,southeast,1725.5523
102,28,33.0,no,southeast,4449.462
103,33,22.705,no,northwest,21984.47061
104,32,28.88,no,northwest,3866.8552


In [8]:
# Reset the index of the dataframe
df = df.reset_index()
df.head()

Unnamed: 0,client_id,age,bmi,smoker,region,charges
0,100,19,27.9,yes,southwest,16884.924
1,101,18,33.77,no,southeast,1725.5523
2,102,28,33.0,no,southeast,4449.462
3,103,33,22.705,no,northwest,21984.47061
4,104,32,28.88,no,northwest,3866.8552


---
## <font color=#14F278> 5. Duplicates:</font>

An important step to working with large datasets is assessing if they contain any <font color=#14F278>**duplications**</font>. Detecting and handling duplicate observations is crucial for any data analysis and can be used for <font color=#14F278>**assessing data quality and supporting data remediation practices**</font>.

---
### <font color=#14F278> 5.1 Detecting Duplicates:</font>
To detect if your dataset contains any duplicate rows:
- use the `duplicated()` method - it returns a Series of `True/False` values, where `True` indicates a duplicated row
- use `any()` method, chained after `duplicated()` to get a high-level answer - if `True`, the dataset contains duplicates
- store the `True/False` series, produced by `duplicated()` in a **mask** and apply it to your dataframe to retrieve the **duplicated rows**

In [9]:
# Returns a Series object of True and False values
# By default, the first occurrences of the duplications will be flagged as False
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
1341    False
1342    False
1343     True
1344     True
1345     True
Length: 1346, dtype: bool

In [19]:
df['col1', 'col2'].duplicated()

KeyError: ('col1', 'col2')

In [10]:
# Chain .any() after the .duplicated() methods to get a True or False value
# if True is returned, the data contains duplicates
df.duplicated().any()

True

In [14]:
df.duplicated(keep = False).any()

True

In [17]:
# Lastly, to retrieve the whole duplicate observations, use Boolean Masking
mask = df.duplicated()
duplicate_df = df[mask]
display(duplicate_df)

Unnamed: 0,client_id,age,bmi,smoker,region,charges


---
### <font color=#14F278> 5.2 Removing Duplicates:</font>
Once detected, the duplicate rows can be easily removed via the `drop_duplicates()` method:

In [16]:
# Dropping Duplicate Rows
df = df.drop_duplicates()

<font color=#FF8181>**Important:**</font> In all of the above examples we defined to rows to be duplicates if **all information in that row** appeared more than once in the dataset. We can also work with **subsests** of columns to both identify and drop duplicates - use the `subset` argument in any of the above methods.

---
## <font color=#14F278> 6. Mutability in Pandas:</font>

Recall the concept of <font color=#14F278>**Mutability**</font>:
- any Python object that suports **in-place changes (mutations) to its values** is called **mutable**
- example of mutable object data types are **lists, dictionaries, sets**, etc.
- any Python object, whose value **cannot be changed (mutated) in-place** is called **immutable**
- example of immutable object data types are **integers, strings, tuples**, etc.


In Pandas, <font color=#14F278>**Series and DataFrames**</font> are <font color=#14F278>**mutable**</font>. And it does make sense - it would be unwise if we have to create a new dataframe object in memory every time we format a dataset or change a value in it.

Let's explore with an example:

In [20]:
filename = r'../data/insurance.csv'
df = pd.read_csv(filename)
display(df)

Unnamed: 0,Client ID,age,bmi,children,smoker,region,CHARGES
0,100,19,27.900,0,yes,southwest,16884.92400
1,101,18,33.770,1,no,southeast,1725.55230
2,102,28,33.000,3,no,southeast,4449.46200
3,103,33,22.705,0,no,northwest,21984.47061
4,104,32,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1341,1441,56,39.820,0,no,southeast,11090.71780
1342,1442,27,42.130,0,yes,southeast,39611.75770
1343,100,19,27.900,0,yes,southwest,16884.92400
1344,101,18,33.770,1,no,southeast,1725.55230


In [21]:
# Check the ID of the object, referenced by variable 'df'
id(df)

2740559555408

All of the methods, covered in this unit, format the dataset in one way or another. Let's rename columns 'Client ID' and 'CHARGES' and explore what happens to our dataframe object in memory:

In [22]:
# Do you think this operation made an in-place change to the object's value?
df = df.rename(columns={'Client ID': 'client_id', 'CHARGES':'charges'})

In [23]:
# To check, retrieve the ID of the object, referenced by variable 'df' after the operation
# What conclusion can we make?
id(df)

2740547348624

To avoid creating a new object in memory every time we format a dataframe, we can use the special `inplace = True` argument. This argument is available for all methods in this unit, and many more:
- `rename(columns = {}, inplace = True)`
- `drop(columns = [], inplace = True)`
- `set_index(inplace = True)`
- `reset_index(inplace = True)`
- `drop_duplicates(inplace = True)`
- etc.

Importantly, when making an <font color=#14F278>**in-place change**</font> via the `inplace=True` argument, we use a <font color=#14F278>**different syntax**</font>:

<center>
    <div>
        <img src="..\images\formatting_002.png"/>
    </div>
</center>


In [24]:
# Data Load
filename = r'../data/insurance.csv'
df = pd.read_csv(filename)
display(df)

# Retrieve object's ID
print(id(df))

Unnamed: 0,Client ID,age,bmi,children,smoker,region,CHARGES
0,100,19,27.900,0,yes,southwest,16884.92400
1,101,18,33.770,1,no,southeast,1725.55230
2,102,28,33.000,3,no,southeast,4449.46200
3,103,33,22.705,0,no,northwest,21984.47061
4,104,32,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1341,1441,56,39.820,0,no,southeast,11090.71780
1342,1442,27,42.130,0,yes,southeast,39611.75770
1343,100,19,27.900,0,yes,southwest,16884.92400
1344,101,18,33.770,1,no,southeast,1725.55230


2740570171408


In [25]:
# Check how the below renaming operation changed the value of the dataframe in-place
df.rename(columns={'Client ID': 'client_id', 'CHARGES':'charges'}, inplace = True)
display(df.head())
print(id(df))

Unnamed: 0,client_id,age,bmi,children,smoker,region,charges
0,100,19,27.9,0,yes,southwest,16884.924
1,101,18,33.77,1,no,southeast,1725.5523
2,102,28,33.0,3,no,southeast,4449.462
3,103,33,22.705,0,no,northwest,21984.47061
4,104,32,28.88,0,no,northwest,3866.8552


2740570171408


In [26]:
# If we perform all steps from this unit with inplace = True:
df.drop(columns = ['children'], inplace=True)
df.set_index('client_id', inplace = True)
df.reset_index(inplace = True)
df.drop_duplicates(inplace = True)
df.head()

Unnamed: 0,client_id,age,bmi,smoker,region,charges
0,100,19,27.9,yes,southwest,16884.924
1,101,18,33.77,no,southeast,1725.5523
2,102,28,33.0,no,southeast,4449.462
3,103,33,22.705,no,northwest,21984.47061
4,104,32,28.88,no,northwest,3866.8552


In [27]:
# Although we performed a number of operations, we actually worked with a single object throughout
# To verify, see how the object's ID before and after the operations is the same
id(df)

2740570171408

---
## <font color=#14F278> 7. Exporting Data to CSV and Excel:</font>

We saw how to import data from external files into a **DataFrame** object, and how to perform simple, yet efficient dataset formatting. The last step of our pipeline is <font color=#14F278>**exporting**</font> the dataframe to a new file:
- to <font color=#14F278>**export to CSV**</font>, use the `to_csv()` method
- to <font color=#14F278>**export to Excel**</font>, use the `to_excel()` method
- if you pass just the name of the new file you want to create, the file will be stored in the same folder as the Jupyter Notebook/Python file you are currently using

In [28]:
df.to_csv('insurance_clean.csv')

In [30]:
df.to_excel('insurance_clean.xlsx')

---
## <font color=#14F278> 8. Summary:</font>

**Data formatting** and working with **external files** is a crucial part of data analysis in Pandas:
- to import from, or export to a CSV file, use the `read_csv()` function and `to_csv()` method
- to import from, or export to an Excel file, use the `read_excel()` function and `to_excel()` method
- to rename columns, use the `rename(columns = {})` method
- to drop columns, use the `drop(columns = [])` method
- to set or reset index, use the `set_index()` or `reset_index()` methods
- to check for duplicate rows, use the `duplicated()` method
- to drop duplicates, use the `drop_duplicates()` method
- all of the above methods (and many more to come) have an optional `inplace = True` argument, allowing us to make all changes **in-place**!


---
## <font color=#FF8181> 9. Concept Check: </font>
For all questions in this Concept Check, we will be working with the following dataframe:

In [44]:
# Data Load
filename = r'../data/insurance.csv'
df = pd.read_csv(filename)

1. Suppose we create a new variable `df1` and equate it to `df`:
- execute `df1 = df1.drop(columns = ['children'])`
- what happened to the ID of the object, associated with `df1`
- what can we conclude about the operation
- what happened to the value, stored in `df` - did it follow suit? Why? Why not?

2. Suppose we create a new variable `df1` and equate it to `df`:
- execute `df1.drop(columns = ['children'], inplace = True)`
- what happened to the ID of the object, associated with `df1`
- what can we conclude about the operation
- what happened to the value, stored in `df` - did it follow suit? Why? Why not?

In [39]:
#1
df1 = df
print(id(df))
print(id(df1))
df1 = df1.drop(columns = ['children'])
print(id(df1))


2740612704528
2740612704528
2740613443856


In [45]:
#2
df1 = df

print(id(df1))
df1.drop(columns = ['children'], inplace = True)
print(id(df1))


2740612702672
2740612702672
