### Data Wrangling
Data Wrangling is the process of gathering, collecting, and transforming Raw data into another format for better understanding, decision-making, accessing, and analysis in less time. Data Wrangling is also known as Data Munging.

Data wrangling in Python deals with the below functionalities:
1. Data exploration: In this process, the data is studied, analyzed, and understood by visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain missing values of NaN, they are needed to be taken care of by replacing them with mean, mode, the most frequent value of the column, or simply by dropping the row having a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements, where new data can be added or pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which are required to be removed or filtered
5. Other: After dealing with the raw dataset with the above functionalities we get an efficient dataset as per our requirements and then it can be used for a required purpose like data analyzing, machine learning, data visualization, model training etc.

In [1]:
# Import liblaries
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#### Data Transformation

In [7]:
# Filtering and Subsetting data, allows you to selec spesific rows or column based on conditions

# Select passenger who survived
survived_passenger = df[df['Survived'] == 1]

# Select columns of interest
selected_columns = df[['Name','Age','Sex','Survived']]

In [8]:
# Survived Passanger
survived_passenger.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [9]:
selected_columns.head()

Unnamed: 0,Name,Age,Sex,Survived
0,"Braund, Mr. Owen Harris",22.0,male,0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,female,1
2,"Heikkinen, Miss. Laina",26.0,female,1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,female,1
4,"Allen, Mr. William Henry",35.0,male,0


In [10]:
# Sorting in arranging data in ascending or descending order

# by descending
sorted_by_age = df.sort_values(by='Age', ascending=False)

In [11]:
sorted_by_age.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.775,,S
493,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q


In [12]:
# Aggregation, combines multiple
# Calculate the average age of passanger
df['Age'].mean()

29.69911764705882

In [13]:
df['Age'].agg('max')

80.0

In [14]:
# Reshape data from wide to long format
melted_data = pd.melt(df, id_vars=['PassengerId', 'Survived'], var_name='Variable', value_name='Value') #Using pd.melt()

In [15]:
melted_data.head()

Unnamed: 0,PassengerId,Survived,Variable,Value
0,1,0,Pclass,3
1,2,1,Pclass,1
2,3,1,Pclass,3
3,4,1,Pclass,1
4,5,0,Pclass,3


In [16]:
# Create new features from existing ones
# Create a new feature 'FamilySize' by adding Parch and Sibp
df['FamilySize'] = df['SibSp'] + df['Parch']

In [17]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


#### Data Grouping and Summarization

In [18]:
# Group data by Pclass and calculate the mean age for each class
class_age_means = df.groupby('Pclass')['Age'].mean()

In [19]:
class_age_means

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

In [20]:
# Calculate the total fare for each class
class_fare_totals = df.groupby('Pclass')['Fare'].sum()

In [21]:
class_fare_totals

Pclass
1    18177.4125
2     3801.8417
3     6714.6951
Name: Fare, dtype: float64

In [22]:
# Froup data by 'Pclass', 'Sex', and 'Embarked' and calculate the mean age for each group
class_sex_embarked_age_means = df.groupby(['Pclass','Sex','Embarked'], as_index = False)['Age'].mean()

In [23]:
class_sex_embarked_age_means

Unnamed: 0,Pclass,Sex,Embarked,Age
0,1,female,C,36.052632
1,1,female,Q,33.0
2,1,female,S,32.704545
3,1,male,C,40.111111
4,1,male,Q,44.0
5,1,male,S,41.897188
6,2,female,C,19.142857
7,2,female,Q,30.0
8,2,female,S,29.719697
9,2,male,C,25.9375


In [24]:
# Pivot table to see the survival rate by class and sex
pivot_table = pd.pivot_table(df, values='Age', index='Pclass', columns ='Sex', aggfunc='mean')

# Create a pivot table to see the survival rate by class, age and sex
pivot_table2 = pd.pivot_table(df, values='Age', index=['Pclass','Sex'], columns=pd.cut(df['Age'], [0,18,30,50,100]), aggfunc='mean')

In [25]:
pivot_table

Sex,female,male
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,34.611765,41.281386
2,28.722973,30.740707
3,21.75,26.507589


In [26]:
pivot_table2

Unnamed: 0_level_0,Age,"(0, 18]","(18, 30]","(30, 50]","(50, 100]"
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,female,15.181818,24.291667,39.486486,56.230769
1,male,10.184,25.52381,41.091837,60.346154
2,female,9.714286,25.6,39.092593,55.333333
2,male,8.288667,25.135135,37.085714,57.583333
3,female,10.209302,24.388889,38.113636,63.0
3,male,11.223922,23.995902,37.584507,59.777778


In [27]:
# Split data based on Pclass
df1 = df[df["Pclass"] == 1]
df2 = df[df["Pclass"] == 2]

In [28]:
df1.shape, df2.shape

((216, 13), (184, 13))

In [29]:
# Combining data frame using concat
# Concate vertically
combined_data = pd.concat([df1, df2])

# Concate Horizontally
combined_data2 = pd.concat([df1, df2], axis = 1)

In [30]:
combined_data.reset_index(drop=True).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize
0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
1,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
2,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,0
3,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S,0
4,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S,0


In [31]:
combined_data2.reset_index(drop=True).head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,...,Name.1,Sex.1,Age.1,SibSp.1,Parch.1,Ticket.1,Fare.1,Cabin,Embarked,FamilySize
0,2.0,1.0,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1.0,0.0,PC 17599,71.2833,...,,,,,,,,,,
1,4.0,1.0,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1.0,0.0,113803,53.1,...,,,,,,,,,,
2,7.0,0.0,1.0,"McCarthy, Mr. Timothy J",male,54.0,0.0,0.0,17463,51.8625,...,,,,,,,,,,
3,12.0,1.0,1.0,"Bonnell, Miss. Elizabeth",female,58.0,0.0,0.0,113783,26.55,...,,,,,,,,,,
4,24.0,1.0,1.0,"Sloper, Mr. William Thompson",male,28.0,0.0,0.0,113788,35.5,...,,,,,,,,,,


In [32]:
# Merging data
merged_data = pd.merge(df1, df2,
                      on='PassengerId', how='left',
                      suffixes=('_A','_B')) # You can change the 'how' to right, left, inner, outer

In [33]:
merged_data.head()

Unnamed: 0,PassengerId,Survived_A,Pclass_A,Name_A,Sex_A,Age_A,SibSp_A,Parch_A,Ticket_A,Fare_A,...,Name_B,Sex_B,Age_B,SibSp_B,Parch_B,Ticket_B,Fare_B,Cabin_B,Embarked_B,FamilySize_B
0,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,...,,,,,,,,,,
1,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,...,,,,,,,,,,
2,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,...,,,,,,,,,,
3,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,...,,,,,,,,,,
4,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,...,,,,,,,,,,


#### Data Storing
You can [Store Data](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html) after wrangling.

In [34]:
# Prepare data
combined_data = combined_data.reset_index(drop=True)

In [35]:
# To Csv
combined_data.to_csv("result/titanic_cleaned.csv")

In [36]:
# To JSON
combined_data.to_json("result/titanic_cleaned.json")

### REFERENCE
1. https://www.geeksforgeeks.org/data-wrangling-in-python/
2. https://pythongeeks.org/data-wrangling-in-python-with-examples/
3. https://www.tutorialspoint.com/python_data_science/python_data_wrangling.htm
4. https://dlab.berkeley.edu/events/python-data-wrangling-and-manipulation-pandas/2023-05-04