In [26]:
import pandas as pd

!wget -nc https://raw.githubusercontent.com/AkeemSemper/ML_for_Non_DS_Students/main/data/titanic_train.csv
FILE_PATH = 'titanic_train.csv'

zsh:1: command not found: wget


# Practice and Review Problems

Let's work through some practice exercises. Some of these will be simple, some a little more complex. For each problem, remember to tackle it systematically:
<ul>
<li> What is the problem asking for? (I.e. what is the end result you are doing/producing?) </li>
<li> What are the inputs? (I.e. what do you have to get started?) </li>
<li> These two steps basically define the start and end of a function that does what you need. </li>
<li> What are the steps in between? Be sure to break the problem down into smaller steps. In words, not code. </li>
<li> Walk through one execution in your head or on paper - are any steps missing, ambiguous, or actually made up of more steps and need more breaking down? </li>
<li> Along the way, does that need any "additons" - new variables, pulling input from somewhere else, etc? </li>
<li> Start the write the code, each step should translate into a line or two of code, since you have already broken it down. </li>
<li> If you don't know how to do a step, it is now one specific thing, look for it in documentation or google "python [thing you need to do]" </li>
<li> Place print statements throughout the code that print out the values of variables as they are being changed, so you can see what is happening, and check for correctness as you go. </li>
<li> Test the code with a few different inputs to make sure it works. </li>
<li> Bob's your uncle! </li>
</ul>

For logistics, feel free to chat in the chat, ask for breakout rooms, talk on the line, etc... Talking things out is useful. 

## Loading Data

You might want to load some data to test, the !wget command above downloads a file from the data folder for this repository. There are many data sources in there, you can navigate through the files in the sidebar to the left by going to the URL: https://github.com/AkeemSemper/ML_for_Non_DS_Students/tree/main/data The URLs can be copied into that command to download the data - note that you need the <b>raw</b> URL - there's a raw button in the github interface to get to it after you click on a file name. Google "github raw url" for more info and details if you need. 

## Question 1

Create a function that takes in:
<ul>
<li> A list of full names that can be in either of the formats: "First Last" or "Last, First" </li>
</ul>

The function should return:
<ul>
<li> A tuple containing four items: </li>
    <ul>
    <li> A list of the first names </li>
    <li> A list of the last names </li>
    <li> A list of the full names in the "First Last" format </li>
    <li> A list of the full names in the "Last, First" format </li>
    </ul>
</ul>

The order should be maintained for the lists. 

In [27]:
sample_input_list = ["John Smith", "Jane Doe", "Smith, Janice", "Doe, John", "Timmons, Tom", "Bob Johnson"]

In [28]:
def process_names(names):
    first_names = []
    last_names = []
    full_names_first_last = []
    full_names_last_first = []
    
    for name in names:
        if ',' in name:
            last, first = name.split(', ')
            first_names.append(first)
            last_names.append(last)
            full_names_first_last.append(first + ' ' + last)
            full_names_last_first.append(name)
        else:
            first, last = name.split(' ')
            first_names.append(first)
            last_names.append(last)
            full_names_first_last.append(name)
            full_names_last_first.append(last + ', ' + first)
    
    return (first_names, last_names, full_names_first_last, full_names_last_first)


In [29]:
process_names(sample_input_list)

(['John', 'Jane', 'Janice', 'John', 'Tom', 'Bob'],
 ['Smith', 'Doe', 'Smith', 'Doe', 'Timmons', 'Johnson'],
 ['John Smith',
  'Jane Doe',
  'Janice Smith',
  'John Doe',
  'Tom Timmons',
  'Bob Johnson'],
 ['Smith, John',
  'Doe, Jane',
  'Smith, Janice',
  'Doe, John',
  'Timmons, Tom',
  'Johnson, Bob'])

## Question 2

Create a function that takes in:
<ul>
<li> A dataframe </li>
<li> A aggregation (i.e. sum, count, mean, median, standard deviation) </li>
</ul>

The function should print:
<ul>
<li> The aggregation of each <i>numeric</i>column in the dataframe </li>
<li> The value counts of each <i>non-numeric</i> column in the dataframe </li>
</ul>

You will need to determine how to identify numeric and non-numeric columns, you can add additional arguments if you need/want (you could potentially make this automatic if the columns are not specified, and use what is provided if they are, but we haven't covered that yet - challenge problem!). Remember, this needs to be generic - so any dataframe should work, with any number/names of numeric and non-numeric columns. You'll also want to test this, so either pull in or create some data that allows you to test. If you're creating data, it is possible to automatically create random values, google that if needed. I personally would use some dataset for testing. 

In [30]:
# Code
def aggregate_dataframe(df, aggregation):
    numeric_columns = df.select_dtypes(include='number').columns
    non_numeric_columns = df.select_dtypes(exclude='number').columns
    
    if aggregation in ['sum', 'count', 'mean', 'median', 'std']:
        print("Aggregation of numeric columns:")
        print(df[numeric_columns].agg(aggregation))
    else:
        print("Invalid aggregation method.")
    
    print("\nValue counts of non-numeric columns:")
    for column in non_numeric_columns:
        print(df[column].value_counts())


In [31]:
# Code
df = pd.read_csv(FILE_PATH)
aggregate_dataframe(df, 'mean')

Aggregation of numeric columns:
PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

Value counts of non-numeric columns:
Braund, Mr. Owen Harris                     1
Boulos, Mr. Hanna                           1
Frolicher-Stehli, Mr. Maxmillian            1
Gilinski, Mr. Eliezer                       1
Murdlin, Mr. Joseph                         1
                                           ..
Kelly, Miss. Anna Katherine "Annie Kate"    1
McCoy, Mr. Bernard                          1
Johnson, Mr. William Cahoone Jr             1
Keane, Miss. Nora A                         1
Dooley, Mr. Patrick                         1
Name: Name, Length: 891, dtype: int64
male      577
female    314
Name: Sex, dtype: int64
347082      7
CA. 2343    7
1601        7
3101295     6
CA 2144     6
           ..
9234        1
19988       1
2693        1
PC 17612   

## Question 3

Create a function that takes in:
<ul>
<li> A dataframe </li>
<li> An operation - "remove", "median", "mean", "mode" </li>
</ul>

The function should return:
<ul>
<li> That data frame, with the operation performed on <b>any missing values</b> in the dataframe. </li>
<li> The calculations should be done on a column by column basis. </li>
<li> Think about what to do, logically, with categorical columns. </li>
</ul>

Again, you'll need some data to test. You may also want to generate some blank spaces in data that doesn't have much - this can be done randomly, and is a good interim step to search for and build. You can also look, on sites like Kaggle or Google Data Search, for datasets that have missing values. You can also upload a file into Colab then use it, Google "upload file to colab" for more step-by-step instructions.

In [32]:
# Code
def handle_missing_values(df, operation):
    if operation == "remove":
        df = df.dropna()
    elif operation == "median":
        df = df.fillna(df.median())
    elif operation == "mean":
        df = df.fillna(df.mean())
    elif operation == "mode":
        df = df.fillna(df.mode().iloc[0])
    else:
        print("Invalid operation.")
    
    return df


## Question 4

Create a function that takes in:
<ul>
<li> A list of strings </li>
</ul>

The function should return:
<ul>
<li> A list of the strings that are palindromes </li>
</ul>

In [33]:
# Code
def find_palindromes(strings):
    palindromes = []
    for string in strings:
        if string == string[::-1]:
            palindromes.append(string)
    return palindromes


In [34]:
sample_list_palin = ["radar", "madam", "hello", "world", "level", "python", "deified", "west", "east", "north", "south"]

In [35]:
find_palindromes(sample_list_palin)

['radar', 'madam', 'level', 'deified']

## Question 5

Create a function that takes in:
<ul>
<li> A dataframe </li>
<li> An "outlier limit" of standard deviations </li>
<li> A "count limit" for categorical columns </li>
</ul>

The function should return:
<ul>
<li> A dataframe with the outliers removed - i.e. anything that is more than X standard deviations higher or lower than the mean is removed.</li>
<li> The categorical columns that have less than X counts removed. </li>
<li> The outliers should be removed on a column by column calculation basis. </li>
</ul>

This will need some data, either real or generated. If you find this easy, try to modify it to handle a list of specifications, so the user can provide the limits individually for each column.

In [36]:
import numpy as np
# Code
def remove_outliers(df, outlier_limit, count_limit):
    # Remove outliers
    numeric_columns = df.select_dtypes(include=np.number).columns
    for column in numeric_columns:
        mean = df[column].mean()
        std = df[column].std()
        lower_limit = mean - outlier_limit * std
        upper_limit = mean + outlier_limit * std
        df = df[(df[column] >= lower_limit) & (df[column] <= upper_limit)]
    
    # Remove categorical columns with count less than count_limit
    categorical_columns = df.select_dtypes(exclude=np.number).columns
    for column in categorical_columns:
        counts = df[column].value_counts()
        df = df[df[column].isin(counts[counts >= count_limit].index)]
    
    return df


In [37]:
remove_outliers(df, 3, 10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
