# Python/Pandas Assessment

### Magdalena RAHN, 18 November 2022  


### Setup
- Clone the repository containing this file to your `~/codeup-data-science/` folder.
- You'll notice that all files are gitignored. This is done to ensure that your work _does not_ get pushed to GitHub. Sharing test questions/answers is an academic integrity issue, so we need to avoid that isssue entirely. Avoid adding this repo to GitHub.
- Upload your completed notebook to the appropriate Google Classroom assignment.

### Orientation
- There are 10 exercises on this assessment worth 10 points each.
- Credit is given for programmatic solutions only; your code shows your work. Since you see the answer in the unit test code, if your function has `return 44`, for example, that's not going to earn credit.
- Your Python/pandas code should run without errors
- After each problem prompt, there is a cell to write your code followed by another cell with a unit test

### Troubleshooting
If you need a fresh start, go to Kernel and then "Restart and Clear Output" in this Jupyter Notebook

In [1]:
# Required Imports and data acquisition
import pandas as pd
from pydataset import data

df = data("tips")
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


####  EXAMPLE: Write a function named `exercise0`
- This function should accept a dataframe as its input argument
- Notice that the example function is returning the appropriate, programmatic code to obtain the solution
- The `assert` line checks the exercise solution code to ensure correctness

In [None]:
# This example function is solved below:
def exercise0(df):
    return len(df)

assert exercise0(df) == 244
print("Exercise 0 example exercise is complete.")

####  Write a function named `exercise1`
- Use the cell below to write your code
- This function should accept a dataframe as its input argument
- This function should return the highest `total_bill` value from the tips dataframe

In [16]:
# Write your code for the exercise1 function here

def exercise1(df):
    return df['total_bill'].nlargest(1)

exercise1(df)

171    50.81
Name: total_bill, dtype: float64

In [15]:
assert exercise1(df) == 50.81
print("Exercise 1 is complete") 

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

####  Write a function named `exercise2`
- Use the cell below to write your code
- This function should return the number of different days in the `day` column.
- This function should accept a dataframe as its input argument.

In [18]:
# Write your code for the exercise2 function definition here


def exercise2(df):
    return df['day'].nunique()

exercise2(df)

4

In [None]:
assert exercise2(df) == 4
print("Exercise 2 is complete")

####  Write a function named `exercise3`
- Use the cell below to write your `exercise3` function definition
- This function should return the number of rows that represent "Lunch" time tables
- A "table" in this dataset is a single row, representing one bill, _not_ the number of people at that table
- This function should accept a dataframe as its input argument

In [22]:
# Write your code for the exercise3 function here

def exercise3(df):
    return sum(df['time'] == 'Lunch')

exercise3(df)

68

In [None]:
assert exercise3(df) == 68
print("Exercise 3 is correct")

####  Exercise 4 is a one line of pandas code, not a function
- Use the cell below to write the code necessary to rename the `size` column to `table_size` on the `df` variable.
- Remember that `.size` is a reserved word in Pandas, so it helps to rename this columns that share a reserved word
- Exercise 4 code is not a function, but should be 1 line of pandas code. 
- Be certain to update the `df` variable or mutate it accordingly, so that `df` has the new column name.

In [25]:
# Write your pandas code to rename the "size" column to "table_size"

df_table_size = df.rename(columns = {'size':'table_size'})
df_table_size.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'table_size'], dtype='object')

In [None]:
assert df.columns.tolist() == ['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'table_size']
print("Exercise 4 is complete")

#### Write a function named `exercise5`
- This function should return the proportion of lunch tables out of all tables
- A "table" in this dataset is a single row, representing one bill, _not_ the number of people at that table
- You can use the full decimal or choose to round to 2 decimal places. Either answer will earn credit 
- This function should accept a dataframe as its input argument

In [34]:
lunch_time = sum(df['time'] == 'Lunch')
lunch_time

68

In [43]:
all_time = len(df['time'])
all_time

244

In [44]:
lunch_time / all_time

0.2786885245901639

In [48]:
# Exercise 5 code here

def exercise5(df):
    return lunch_time / all_time


exercise5(df)


0.2786885245901639

In [None]:
assert exercise5(df) in [0.2786885245901639, 0.28]
print("Exercise 5 is correct")

#### Exercise 6
- Write a function named `exercise6`
- This function should return the number of rows where the `total_bill` is greater than the average of all `total_bill` values.
- This function should accept a dataframe as its input argument

In [55]:
all_bills = df['total_bill'].sum()
all_bills

4827.77

In [60]:
avg_all = round(all_bills / len(df), 2)
avg_all


19.79

In [70]:
# df['total_bill' > avg_all]

sum(df['total_bill'] > avg_all)

99

In [71]:
# Exercise 6 code here

def exercise6(df):
    return sum(df['total_bill'] > avg_all)

exercise6(df)


99

In [None]:
assert exercise6(df) == 99
print("Exercise 6 is correct")

#### Exercise 7
- Write a function named `exercise7`
- This function should return the highest `total_bill` value for Thursday dinner tables (each row is a table).
- This function should accept a dataframe as its input argument

In [255]:
plus_grande_facture = df['total_bill'].nlargest()
plus_grande_facture

171    50.81
213    48.33
60     48.27
157    48.17
183    45.35
Name: total_bill, dtype: float64

In [258]:
essayons = df[df['day'].str.startswith('S')] & df[('plus_grande_facture')]
essayons

KeyError: 'plus_grande_facture'

In [259]:
encoreunessai = Thur_dinner & plus_grande_facture
encoreunessai

TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]

In [88]:
Thur_dinner = sum(df['day'] == 'Thur') & sum(df['time'] == 'Dinner')
Thur_dinner

48

In [112]:
# df['total_bill'] == Thur_dinner

# df('Thurs_dinner')['total_bill'].str.nlargest()

1      False
2      False
3      False
4      False
5      False
       ...  
240    False
241    False
242    False
243    False
244    False
Name: total_bill, Length: 244, dtype: bool

In [200]:
sum(df['day'] == 'Thur') & sum(df['time'] == 'Dinner')


48

In [204]:
df['total_bill'].nlargest() & df[(df['day'] == 'Thur') & (df['time'] == 'Dinner')]

TypeError: unsupported operand type(s) for &: 'bool' and 'str'

In [139]:
djfj = df[df['time'] == 'Thur'].sort_values('total_bill').head()
djfj


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size


In [188]:
fjj = int[('total_bill')]
fjj

TypeError: 'type' object is not subscriptable

In [None]:
# Exercise 7 code here

def exercise7(df):
    return 



In [None]:
assert exercise7(df) == 18.78
print("Exercise 7 is correct")

#### Exercise 8
- Write a function named `exercise8`
- This function should return the highest `total_bill` for tables on Thursday or Friday
- This function should accept a dataframe as its input argument

In [182]:
dftf = sum(df['day'].isin(['Thur' or 'Fri']))['total_bill'].max()
dftf

TypeError: 'int' object is not subscriptable

In [None]:
# Exercise 8 code here

def exercise8(df):
    return 


In [None]:
assert exercise8(df) == 43.11
print("Exercise 8 is correct")

#### Exercise 9
- Write a function named `exercise9`
- This function should return the average `total_bill` for tables dining on a Saturday or Sunday
- This function should accept a dataframe as its input argument

In [175]:
df['total_bill'].max()

50.81

In [177]:
df[(df['day'].isin(['Sat','Sun']))].head() 


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [173]:
(df['total_bill'].isin(['Sat','Sun'])).mean()

0.0

In [None]:
df[(df['day'].isin(['Sat','Sun']))].head()  & df['total_bill'].max()

# error msg


In [174]:
# Exercise 9 code here:

'''
def exercise9(df):
    return 20.9

exercise9(df)


## I just wanted to see if this would work. Clearly, it's not the requested function.

'''

20.9

In [None]:
assert exercise9(df) in [20.89300613496933, 20.9]
print("Exercise 9 is correct")

#### Exercise 10
- Write a function named `exercise10`
- This function should take in the `prices` series as its input argument.
- This function should clean these strings and our strings with dollar signs and commas into proper floats.
- The `exercise10` function should return a series containing only floats

In [140]:
prices = pd.Series(["$1,234.56", "$2,345,678.99", "$123.45", "$3,333,333.99"])

In [153]:
p1 = prices.str.replace('$','')
p1

  p1 = prices.str.replace('$','')


0        1,234.56
1    2,345,678.99
2          123.45
3    3,333,333.99
dtype: object

In [154]:
p2 = p1.str.replace(',','')
p2

0       1234.56
1    2345678.99
2        123.45
3    3333333.99
dtype: object

In [155]:
p3 = pd.to_numeric(p2)
p3

0       1234.56
1    2345678.99
2        123.45
3    3333333.99
dtype: float64

In [156]:
# Write your function definition for exercise10 here

def exercise10(prices):
    return p3

exercise10(prices)


0       1234.56
1    2345678.99
2        123.45
3    3333333.99
dtype: float64

In [None]:
assert exercise10(prices).values.tolist() == [1234.56, 2345678.99, 123.45, 3333333.99]
print("Exercise 10 is correct.")