# Welcome to the Data Manipulation Exercises

The workbook has been broken up into three sections.  Each section correlates to a reading assignment within the textbook.

In [136]:
import pandas as pd
import numpy as np

data= pd.read_csv("titanic.csv")

## Before You Get Started

We are going to be using the Titanic Dataset. Make sure to run a head() before you start working with manipulation methods.

In [137]:
# Run the head of your data set here:
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [138]:
# check for duplicates
data.isnull()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
887,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,True,False,False,False,False,False,False,False,True,False,False,False
889,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [139]:
# if there are, go ahead and drop them:
data = data.dropna()
print(data)


     survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
1           1       1  female  38.0      1      0  71.2833        C  First   
3           1       1  female  35.0      1      0  53.1000        S  First   
6           0       1    male  54.0      0      0  51.8625        S  First   
10          1       3  female   4.0      1      1  16.7000        S  Third   
11          1       1  female  58.0      0      0  26.5500        S  First   
..        ...     ...     ...   ...    ...    ...      ...      ...    ...   
871         1       1  female  47.0      1      1  52.5542        S  First   
872         0       1    male  33.0      0      0   5.0000        S  First   
879         1       1  female  56.0      0      1  83.1583        C  First   
887         1       1  female  19.0      0      0  30.0000        S  First   
889         1       1    male  26.0      0      0  30.0000        C  First   

       who  adult_male deck  embark_town alive  alone  
1    wo

### Cleaning Note:

While the columns are not the "prettiest", don't adjust any of them yet. We are going to update some values and add some values as we work through this notebook. Apologies for the extra visual "noise" on your screen. You will be given the option to tidy up the columns at the end of this notebook.

## Running Tables Note:  
If your tables don't appear to have accepted your changes, try the "Run All" option in the "Cell" section of the menu bar.  

<span style="background-color:dodgerblue; color:dodgerblue;">- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -</span> 

# A. Aggregation

1. Work through the section Exercises.  
    - There are 4 sections in part A:
        - Groupby
        - Aggregation Methods
        - Groupby and Basic Math
        - Groupby and Multiple Aggregations


#### Creating Variables.

As we begin to manipulate our data, create new variables to store your work in.  This will keep your original data intact.  Having the original dataset available will save you time with each manipulation.  You can also create variable names that inform you of the purpose of the manipulation.

### 1: Groupby <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

#### Groupby "embark_town"

1. Using the titanic data set, groupby "embark_town".
1. Create a variable that will represent the grouping of data.
1. Initialize it using the groupby() function and pass it the column.

In [140]:
# Code your groupby "embark_town" here:
grouping_embark = data.groupby('embark_town')

In [141]:
# To view the grouped data as a table, use the variable_name.first():
print(grouping_embark.first())

             survived  pclass     sex   age  sibsp  parch     fare embarked  \
embark_town                                                                   
Cherbourg           1       1  female  38.0      1      0  71.2833        C   
Queenstown          0       1    male  44.0      2      0  90.0000        Q   
Southampton         1       1  female  35.0      1      0  53.1000        S   

             class    who  adult_male deck alive  alone  
embark_town                                              
Cherbourg    First  woman       False    C   yes  False  
Queenstown   First    man        True    C    no  False  
Southampton  First  woman       False    C   yes  False  


#### Groupby "survived"

Did you know that you can also chain on some of our exploratory methods to the groupby method?

1. Create & initalize a new variable to hold a table that will groupby "survived" 
1. Use method chaining to tack on the describe method

In [142]:
# Code your groupby "survived" table here:
grouping_survived = data.groupby('survived')

# run your table below:
print(grouping_survived.first())


          pclass     sex   age  sibsp  parch     fare embarked  class    who  \
survived                                                                       
0              1    male  54.0      0      0  51.8625        S  First    man   
1              1  female  38.0      1      0  71.2833        C  First  woman   

          adult_male deck  embark_town alive  alone  
survived                                             
0               True    E  Southampton    no   True  
1              False    C    Cherbourg   yes  False  


In [143]:
# run your table with describe
grouping_survived.describe()

Unnamed: 0_level_0,pclass,pclass,pclass,pclass,pclass,pclass,pclass,pclass,age,age,...,parch,parch,fare,fare,fare,fare,fare,fare,fare,fare
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
survived,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,59.0,1.220339,0.589207,1.0,1.0,1.0,1.0,3.0,59.0,41.288136,...,1.0,4.0,59.0,64.532131,62.10809,0.0,27.15,50.4958,79.2,263.0
1,123.0,1.178862,0.479626,1.0,1.0,1.0,1.0,3.0,123.0,32.905854,...,1.0,2.0,123.0,85.821107,81.843522,8.05,30.5,69.3,93.5,512.3292


In [144]:
# How is this table organized?  Why are there 40 columns now?
The table is organized as a multi-level index, grouping data by survived (0 or 1) along the rows and using hierarchical column
headers for statistics.
There are 40 columns because the .describe() method calculated 8 different statistics (count, mean, std, min, 25%, 50%, 75%, max)
for each of the 5 numerical columns in the dataset (pclass, age, sibsp, parch, fare).



SyntaxError: invalid syntax (3006894281.py, line 2)

### 2. Aggregation Methods <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

Note: **agg()** and **aggregate()** are identical [source](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.aggregate.html)

#### Method Chaining

1. Create a variable to apply **agg()** to your grouped data.
1. Pass one of the following statistical values to **agg()**
   - "mean", "median", "mode", "min", "max", "std", "var", "first", "last", "sum"

In [145]:
# Code your method chain here:
  
survived_means = grouping_survived.agg("mean")   

In [146]:
# Create a variable to apply agg("sum") to your grouped data
survived_sum = grouping_survived.agg("sum")
# run your table:
# print(survived_means)
print(survived_sum)

          pclass      age  sibsp  parch        fare  adult_male  alone
survived                                                              
0             72  2436.00     22     27   3807.3957          53     30
1            145  4047.42     63     60  10555.9961          34     48


In [147]:
# Explain the sum table.  What is going on with the "sex", "class", and "alive" columns?

The sum table aggregates all numerical columns (pclass, age, fare, etc.) by survival status (0=Died, 1=Survived).

The "sex", "class", and "alive" columns are missing because they contain text (non-numerical) data which cannot be 
arithmetically summed by the .sum() aggregation method.



SyntaxError: invalid syntax (3291564754.py, line 3)

#### Using a Dictionary <span style="color:darkorange;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 
##### A dictionary is a Python collection type.  

Is a collection type that stores **key-value pairs**.  A key-value pair is an organization system that is made up of a single *key* that has one or more *values* paired with it.  
Think of it like your contacts list.  The contacts list is the dictionary object.  
Each contact is organized by a key, usually name.  And attached to each name is contact information, or the values.
Some contacts might have email address, phone number, home or work address, etc. Other contacts may just be a name and phone number.  This is a very simple example, but understanding this organizational structure will be helpful as you learn to manipulate tables.  

*Here is a dictionary example with 3 keys:*
>**contacts_dictionary = {"name1": ["email", 555-5552, "work info"], 
      "name2": ["email", 555-5554],
      "name3": 555-5555}**
                     
*Here is a dictionary example with a single key-value pair*
**study_group_dictionary = {"john": ["john@email.com", 555-555-5555, "works at LaunchCode headquarters"]}**   

It has a single key, and a list of values. The organization of this structure is called a "Key-Value Pair".
Using the contact list example, the key would be the name of the person and the values would be their contact information.  The key is a single item (the person's name) and the values can be a single item (an email address) or mulitple items (email, phone number, address, work info, etc).
Keys and values can be any data type, but must use correct data type syntax.  The keys do not have to be strings, but they do need to be a single value.  

For more information, you can read more on dictionary objects [here](https://www.w3schools.com/python/python_dictionaries.asp).


#### Aggregation across multiple columns using dictionary functionality

##### Syntax Example:

**age_dictionary={"age":["sum", "max"]}**

We are creating a new dictionary (**age_dictionary**).  The key is **age** and the values we want are **"sum""** and **"max"**.  This dictionary object has now become a template for the aggregations we want to perform.  However, on it's own, it does nothing.  Once passed to the **agg()** method, it will pick out the specific location of data we want to examine, making a subset table.

The code is contained in the box below.  Run it and see what happens.


For syntax examples, review [this webpage](https://www.geeksforgeeks.org/python-pandas-dataframe-aggregate/).
#### <span style="color:coral;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span>

In [148]:
# Predict the table output before you uncomment the code below.

age_dictionary={"age":["sum", "max"]}
dictionary_agg=data.agg(age_dictionary)
dictionary_agg

Unnamed: 0,age
sum,6483.42
max,80.0


1. What if we want to look at more than one column at a time?  We pass more dictionaries to the agg function.
1. Create a variable to hold at least 3 columns.  Use the syntax from the "Syntax Example" as a guide.
    - Aggregate the following:  survived: "sum" & "count"; age: "std" & "min", and sibsp: "count" & "sum"

In [149]:
# Code your dictionary here:
example_dictionary = {
    'survived' : ["sum","count"],
    'age' : [ "std","min"],
    'sibsp' : ["count","sum"]
}
dictionary_aggregate = data.agg(example_dictionary)
print(dictionary_aggregate)


       survived        age  sibsp
sum       123.0        NaN   85.0
count     182.0        NaN  182.0
std         NaN  15.671615    NaN
min         NaN   0.920000    NaN


### 3. Groupby and Basic Math <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

1. Groupby "pclass".  Make sure you use a variable to hold your grouped data.

In [150]:
# <WARNING> For this section, you will need to filter out non-numeric data to run the aggregations
# The line below creates a new dataframe with only numeric columns to avoid warnings
num_only_data = data.select_dtypes(include='number')

# Code your groupby here using num_only_data:
passenger_class = num_only_data.groupby('pclass')

# Run your table using first() here instead of head():
print(passenger_class.first())

        survived   age  sibsp  parch     fare
pclass                                       
1              1  38.0      1      0  71.2833
2              1  34.0      0      0  13.0000
3              1   4.0      1      1  16.7000


### 4. Groupby and Multiple Aggregations <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

#### Group with a List<span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span>

1. We want to do muliple aggregation functions to our newly grouped data set.  We created a variable to hold a list of functions we want to perform.  These functions are part of the agg method.  When we pass our list to the method, the method will iterate through each item and perform that function for the entire table.

In [151]:
# our list of functions
agg_func_list = ['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'first', 'last', 'count']


#Apply the agg method to our passenger_class variable (made in the Groupby Basic Math section).  
# Pass our list to the function and run your table.
grouping_function = passenger_class.agg(agg_func_list) 
print(grouping_function)

  

       survived                                                                \
            sum      mean median min max       std       var first last count   
pclass                                                                          
1           106  0.675159    1.0   0   1  0.469814  0.220725     1    1   157   
2            12  0.800000    1.0   0   1  0.414039  0.171429     1    0    15   
3             5  0.500000    0.5   0   1  0.527046  0.277778     1    1    10   

        ...        fare                                                  \
        ...         sum       mean   median    min       max        std   
pclass  ...                                                               
1       ...  13976.4501  89.021975  71.2833   0.00  512.3292  77.644586   
2       ...    276.6667  18.444447  13.0000  10.50   39.0000  10.141895   
3       ...    110.2750  11.027500  10.4625   7.65   16.7000   3.531942   

                                            
                v

#### Group with a Dictionary<span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span>

Using only a list provides us with the entire table.  What if we only want to look at age vs pclass?  

we can create a dictionary to hold the age column for us.  The *key* would be the name of our column, and the values our list of functions to perform on that column.  The code would look like this:

In [152]:
agg_func_dict = {
    'age':
    ['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'first', 'last', 'count']
}
# We would run our table like this:
passenger_class.agg(agg_func_dict)  

Unnamed: 0_level_0,age,age,age,age,age,age,age,age,age,age
Unnamed: 0_level_1,sum,mean,median,min,max,std,var,first,last,count
pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
1,5894.42,37.544076,36.0,0.92,80.0,14.955177,223.657317,38.0,26.0,157
2,379.0,25.266667,29.0,1.0,57.0,16.191782,262.17381,34.0,57.0,15
3,210.0,21.0,24.5,2.0,42.0,13.190906,174.0,4.0,27.0,10


Looking at the *age_func_dict* syntax, create a dictionary variable for the "survived" column and pass it to **passenger_class.agg()** in the box below.

In [153]:
# Code it here:
agg_survived_dict = {
    
    'survived':
    ['sum', 'mean', 'median', 'min', 'max', 'std', 'var', 'first', 'last', 'count']
}

passenger_class.agg(agg_survived_dict) 


Unnamed: 0_level_0,survived,survived,survived,survived,survived,survived,survived,survived,survived,survived
Unnamed: 0_level_1,sum,mean,median,min,max,std,var,first,last,count
pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
1,106,0.675159,1.0,0,1,0.469814,0.220725,1,1,157
2,12,0.8,1.0,0,1,0.414039,0.171429,1,0,15
3,5,0.5,0.5,0,1,0.527046,0.277778,1,1,10


<span style="background-color:dodgerblue; color:dodgerblue;">- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -</span> 

# B. Recoding and Creating New Values and Variables 

1. Work through the Part B, there are 3 sections

### Create a New Column <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span>

In the box below:
1. Create a new column by manipulating the values of an existing column.  Specifically, create a new column, "fare_2021" that allows us to compare the cost of fare in pounds back in 1912 to 2021.  The inflation multiplier from 1912 to 2021 is approximately 117.17.

In [154]:
# Code your new "fare_2021" column here:
inflation_multiplier = 117.17
data["fare_2021"] = data["fare"]*inflation_multiplier
# Run the head of your table to see your new column:
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,fare_2021
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,8352.264261
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,6221.727
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True,6076.729125
10,1,3,female,4.0,1,1,16.7,S,Third,child,False,G,Southampton,yes,False,1956.739
11,1,1,female,58.0,0,0,26.55,S,First,woman,False,C,Southampton,yes,True,3110.8635


### Replacing Values <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 
 
Replace the string-based "yes","no" values in the "alive" column using booleans, replacing "yes" values with True and "no" values with False.

In [155]:
# Code your updated values here:
data["alive"] = data["alive"].replace(to_replace={"yes": True, "no": False})




We can also use functions to update values.

1. Create a function that will convert the string-based "alive" values of "yes" or "no" to a boolean value of True or False. Apply it to your table and run your table here:

In [156]:
# Code your function here:
def convert_string_based(value):
   
     if value == 'yes': 
        return True 
     elif value == 'no':
        return False
     else:
        return value
    
    
data['alive'] = data['alive'].map(convert_string_based)


### Using a function to create a new column <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

Sometimes you might want to create a new column based on combining multiple columns together.

1. create an "age_group" column that breaks years up as 0-19, 20-29, 30-39, etc until all given ages are covered.  Make sure you check to see where you can stop counting by 10s.

In [157]:
# Write your max age check here:
max_age= data["age"].max()


In [158]:
# Code the new "age_group" column function here:
def get_age_group(age): 
     if pd.isna(age):
        return "Unknown" 
     elif age < 20:
    
        return "0-19"
     elif age < 30:
        
        return "20-29"
     elif age < 40:
        
        return "30-39"
     elif age < 50:
        return "40-49"
     elif age < 60:
        return "50-59"
     elif age < 70:
        return "60-69"
     elif age < 80:
        return "70-79"
     else:
        return "80+"
data["age_group"] = data["age"].map(get_age_group)
print(data["age_group"])


1      30-39
3      30-39
6      50-59
10      0-19
11     50-59
       ...  
871    40-49
872    30-39
879    50-59
887     0-19
889    20-29
Name: age_group, Length: 182, dtype: object


<span style="background-color:dodgerblue; color:dodgerblue;">- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -</span> 

# C. Reshaping Tables

1. Work through Part C, there are 4 sections

### Sort_values <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

Use **sort_values()** to answer the following question:
> What is the age of the person who paid the highest fare?

Hint: We want to see the highest fare value first. To find this, should we set the sort order to ascending or descending?  Check the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html?highlight=sort_values#pandas.DataFrame.sort_values) for the syntax.

In [159]:
# Code your sort_values here:
highest_fare = data.sort_values(by="fare", ascending=False)

# Run your table here:
print(highest_fare)

     survived  pclass     sex   age  sibsp  parch      fare embarked  class  \
737         1       1    male  35.0      0      0  512.3292        C  First   
679         1       1    male  36.0      0      1  512.3292        C  First   
341         1       1  female  24.0      3      2  263.0000        S  First   
88          1       1  female  23.0      3      2  263.0000        S  First   
438         0       1    male  64.0      1      4  263.0000        S  First   
..        ...     ...     ...   ...    ...    ...       ...      ...    ...   
715         0       3    male  19.0      0      0    7.6500        S  Third   
75          0       3    male  25.0      0      0    7.6500        S  Third   
872         0       1    male  33.0      0      0    5.0000        S  First   
806         0       1    male  39.0      0      0    0.0000        S  First   
263         0       1    male  40.0      0      0    0.0000        S  First   

       who  adult_male deck  embark_town  alive  al

### pivot_table <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 
1. pivot the table of the summed data where the values are "fare", index is "who" and "age_group", and the columns are "survived"

Hint: set the aggfunc parameter to np.sum




In [160]:
# Code your pivot_table here:
pivot_table = data.pivot_table(values='fare', aggfunc='sum', index=['who','age_group'], columns='survived')

# Run your table here:
print(pivot_table)


survived                0          1
who   age_group                     
child 0-19       162.0125   843.9208
man   0-19       432.6500   110.8833
      20-29      719.7583   431.7584
      30-39      529.3583  1474.7251
      40-49      637.5416   411.1543
      50-59      575.2958    92.5500
      60-69      443.9000    79.2000
      70-79      105.6542        NaN
      80+             NaN    30.0000
woman 0-19            NaN  1066.6917
      20-29      162.0125  1710.2750
      30-39           NaN  2174.7041
      40-49           NaN   940.2418
      50-59       39.2125  1036.6833
      60-69           NaN   153.2083


### Wide to Long <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

1. Create a table where the columns are "who" and the values are "pclass"
1. Answer the question:  How does this table differ from the pivot_table above?  Specifically, how is "who" different?

In [161]:
# Code your table here:
who_pclass_tables = data.pivot_table(columns='who', values='pclass')

# Run your table here:
print(who_pclass_tables)

# Answer the question here:
In short, in the first table, who acted as a category label to group rows of data together.
In the second table, who defines the structure of the columns themselves.



SyntaxError: invalid syntax (207541401.py, line 8)

### Melt <span style="color:dodgerblue;"> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - </span> 

1.  What does the **melt** function do to the data? 

In [162]:
# What does melt do?
melt function can be used to reshape tables from wide format to long format. 

SyntaxError: invalid syntax (1012029596.py, line 2)

2.  With a new variable, apply a default melt to your data.

In [163]:
# Create your default melt table here with the following syntax: 
new_name = pd.melt(data)
# Run your table here:
print(new_name)
# Check the shape of your new table.
new_name.shape

       variable  value
0      survived      1
1      survived      1
2      survived      0
3      survived      1
4      survived      1
...         ...    ...
3089  age_group  40-49
3090  age_group  30-39
3091  age_group  50-59
3092  age_group   0-19
3093  age_group  20-29

[3094 rows x 2 columns]


(3094, 2)

3. Create a melt table where the index variables are "embarked", and the values are "fare" and "deck"

In [164]:
# Create your melt table here:
melt_table_embarked= data.melt(id_vars= ['embarked'],value_vars= ['fare' , 'deck'])
# Run your table here:
print(melt_table_embarked)
# Check the shape


    embarked variable    value
0          C     fare  71.2833
1          S     fare     53.1
2          S     fare  51.8625
3          S     fare     16.7
4          S     fare    26.55
..       ...      ...      ...
359        S     deck        D
360        S     deck        B
361        C     deck        C
362        S     deck        B
363        C     deck        C

[364 rows x 3 columns]


# Optonal Challenges:

1. Clean and Explore the table.  
A. How would you handle any missing data?
    The visible columns in this subset appear mostly complete (age is present).
For any remaining NaN values in age in the full dataset, I would impute using the median age of the entire group.

B. Would you keep all of the columns?
    No. I would drop the following redundant or low-value columns:
deck: Most of the full dataset is likely missing this data (though it is present in your subset).
class: Redundant information to the numerical pclass column.
convert_string_based: This column seems to be a redundant boolean flag and can likely be dropped.
C. Would you want to manipulate any data?
    Yes, for better analysis and modeling:
Create Family_Size: Combine sibsp + parch to analyze family influence.
Convert alive to integer: Change True/False to 1/0 for easier mathematical operations and statistical modeling.
Drop unnecessary columns: Streamline the table by dropping the columns mentioned above.