In [5]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 3 – Hypothesis Testing and DataFrame Manipulation

## DSC 80, Spring 2022

### Due Date: Monday, April 18th at 11:59 PM

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying `.py` file will be tested (a la DSC 20),
2. The notebook may be graded (if it contains free response questions or asks you to draw plots).

**Note**: Labs will have public tests and private tests. The public "smoke tests" that you will run below and which appear on Gradescope are generally worth no points. After the due date, we will replace these tests with private tests that will determine your grade. This is different from DSC 10, where labs only had public tests!

**Do not change the function names in the `*.py` file!**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name.
- If you changed something you weren't supposed to, just use git to revert! Ask us if you need help with this, or google around for `git revert`.

**Tips for working in the notebook**:
- The notebooks serve to present the questions and give you a place to present your results for later review.
- The notebooks in *lab assignments* are not graded (only the `.py` file is submitted and graded).
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file. You can write code here, but make sure that all of your real work is in the `.py` file.

**Tips for developing in the `.py` file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional helper functions to solve the lab! 
- Always document your code!

### Importing code from `lab.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [6]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [7]:
from lab import *

In [8]:
import os
import io
import pandas as pd
import numpy as np

## Part 1: Hypothesis Testing

In this section we'll develop an intuition for the terms and structure of hypothesis testing – it's nothing to be afraid of!

The first step is always to define what you're looking at, create your hypotheses, and set a level of significance.  Once you've done that, you can find a p-value which is related to your test statistic.

If all of these words are scary: look at the [Lecture 6](https://dsc80.com/resources/lectures/lec06/lec06.html) notebook, the readings, and don't forget to think about the real-world meaning of these terms!  The following example describes a real-world scenario, so you can think of it in a normal lens.

### Question 1 – Tires 🚗

A tire manufacturer, TritonTire, claims that their tires are so good, they will bring a Toyota Highlander from 60 mph to a complete stop in under 106 feet, 97% percent of the time.

Now, you own a Toyota Highlander equipped with TritonTire tires, and you decide to test this claim. You take your car to an empty Vons parking lot, speed up to exactly 60 mph, hit the brakes, and measure the stopping distance. As illegal as it is, you repeat this process 50 times and find that **you stopped in under 106 feet only 47 of the 50 times**.

Livid, you call TritonTire and say that their claim is false. They say, no, that you were just unlucky: your experiment is consistent with their claim. But they didn't realize that they are dealing with a *data scientist* 🧑‍🔬.

To settle the matter, you decide to unleash the power of the hypothesis test. The following three subparts ask you to answer a total of four select-all multiple choice questions.

#### Question 1.1

You will set up a hypothesis test in order to test your suspicion that the tires are are actually worse than claimed. Which of the following are valid null and alternative hypotheses for this hypothesis test?

1. The tires will stop your car in under 106 feet exactly 97% of the time.
0. The tires will stop your car in under 106 feet less than 97% of the time.
0. The tires will stop your car in under 106 feet greater than 97% of the time.
0. The tires will stop your car in more than 106 feet exactly 3% of the time.
0. The tires will stop your car in more than 106 feet less than 3% of the time.
0. The tires will stop your car in more than 106 feet greater than 3% of the time.

Create a function called `car_null_hypoth` which takes zero arguments and returns a list of integers, corresponding to the the valid null hypotheses above.
Also create a function called `car_alt_hypoth` which takes zero arguments and returns a list of integers, corresponding to the valid alternative hypotheses above.

<br>

#### Question 1.2

Which of the following are valid test statistics for our question?

1. The number of times the car stopped in under 106 feet in 50 attempts.
1. The average number of feet the car took to come to a complete stop in 50 attempts.
1. The number of attempts it took before the car stopped in under 95 feet.
1. The proportion of attempts in which the car successfully stopped in under 106 feet.

Create a function called `car_test_stat` which takes zero arguments and returns a list of integers, corresponding to the valid test statistics above.

<br>

#### Question 1.3

The p-value is the probability, under the assumption the null hypothesis is true, of observing a test statistic **equal to our observed statistic, or more extreme in the direction of the alternative hypothesis**.

Why don't we just look at the probability of observing a test statistic equal to our observed statistic? That is, why is the "more extreme in the direction of the alternative hypothesis" part necessary?

1. Because our observed test statistic isn't extreme.
4. Because our null hypothesis isn't suggesting equality.
5. Because our alternative hypothesis isn't suggesting equality.
2. Because the probability of finding our observed test statistic equals the probability of finding something more extreme.
3. Because if we run more and more trials (where a trial is speeding up the car then stopping), the probability of finding *any* observed test statistic gets closer and closer to zero, so if we did this we would always reject the null with more trials even if the null is true.


Create a function `car_p_value` which takes zero arguments and returns the correct reason as an integer (not a list).

In [9]:
grader.check("q1")

## Part 2: Grouping

Last month, the UK 🇬🇧 announced a new ["High Potential Individual" visa](https://www.lexology.com/library/detail.aspx?g=41fa64ec-9272-468c-bdcb-8002745a754f), which allows graduates of universities ranked in the Top 50 globally to move to the UK without a job lined up. This visa has been a subject of much debate, in part due to how much rankings play a role. (Rest assured, UCSD is on the list!)

In this section, you will analyze a dataset of university rankings, collected from  [here](https://www.kaggle.com/datasets/mylesoneill/world-university-rankings?datasetId=) (though we have pre-processed and modified the original dataset for the purposes of this question). Our version of the dataset is stored in `data/universities_unified.csv`.

Columns:
* `'world_rank'`: world rank of the institution
* `'institution'`: name of the institution
* `'national_rank'`: rank within the nation, formatted as `'country, rank'`
* `'quality_of_education'`: rank by quality of education
* `'alumni_employment'`: rank by alumni employment
* `'quality_of_faculty'`: rank by quality of faculty
* `'publications'`: rank by publications
* `'influence'`: rank by influence
* `'citations'`: rank by number of citations
* `'broad_impact'`: rank by broad impact
* `'patents'`: rank by number of patents
* `'score'`: overall score of the institution, out of 100
* `'control'`: whether the university is public or private
* `'city'`: city in which the institution is located
* `'state'`: state in which the institution is located

### Question 2 – Rankings 1️⃣

There are (still) a few aspects of the dataset we need to clean before it's ready for analysis.

#### `clean_universities`

Create a function `clean_universities` which takes in the raw rankings DataFrame and returns a cleaned DataFrame, cleaned according to the following information:

- Some `'institution'` names contain `'\n'` characters (e.g. `'University of California\nSan Diego'`). Replace all instances of `'\n'` with `', '` (a comma and a space) in the `'institution'` column.

- Change the data type of the `'broad_impact'` column to `int`.

* Split `'national_rank'` into two columns, `'nation'` and `'national_rank_cleaned'`, where:
    * `'nation'` is the country indicated in the first part of `'national_rank'`. 
        * Note that there are **3** countries that appear under different names for different schools. For all 3 of these countries, you should pick **the name that is longer** and use that name for every occurrence of the country. One of the 3 countries is **`'Czech Republic'`**, which also appears as **`'Czechia'`** – since these refer to the same country and `'Czech Republic'` is longer, all instances of either name should be replaced with `'Czech Republic'`. You need to find the other 2 countries on your own. 
        * As is mentioned below, your function will only be tested on the DataFrame in `data/universities_unified.csv`, so you don't need to worry about country names other than these 3.
    * `'national_rank_cleaned'` is the integer in the latter part of `'national_rank'`. Make sure that the data type of this column is `int`. 
    * Don't include the original `'national_rank'` column in the output DataFrame.
* Create a Boolean column `'is_r1_public'`. This column should contain `True` if a university is public and classified as R1 and `False` otherwise. Treat `np.NaN`s as False. **Note that in the raw DataFrame, a university is classified as R1 if and only if it has non-null values in all of the following columns: `'control'`, `'city'`, and `'state'`.**
    - Read [this page](https://en.wikipedia.org/wiki/List_of_research_universities_in_the_United_States) to learn more about what it means for a university to be classified as R1.
    
**The only dataset your function will be tested on is `data/universities_unified.csv`; you don't need to worry about other hidden test sets.** In addition, please return a *copy* of the original DataFrame; don't modify the original.

<br>

Now, we can do some basic exploration.

#### `university_info`

Create a function `university_info` that takes in the **cleaned** DataFrame outputted by `clean_universities` and returns the following values in a list:
* The `'nation'` with the highest average `'world_rank'`, among universities that have **at most** 500 `'citations'`. (**EDIT:** We mean universities with a `'citations'` rank that is less than or equal to 500.)
* The mean `'publications'` rank amongst universities that are at the bottom 200 in the dataset by `'quality_of_education'`.
* The `'state'` that has the highest proportion of public R1 universities, amongst all US states. (**EDIT:** There happen to be multiple states with the highest proportion; return any of them.)
* The lowest ranking `'institution'`, according to `'world_rank'`, that is ranked #1 in its nation (i.e. that has a `'national_rank_cleaned'` of 1).

You can assume there are no ties.

In [161]:
university_info(clean_universities(df))

['Singapore', 534.945, 'AL', 'University of Bucharest']

In [158]:
df_cleaned = clean_universities(df)
df_cleaned[df_cleaned['national_rank_cleaned'] == 1].sort_values(by = 'world_rank', ascending = False)['institution'].iloc[0]

'University of Bucharest'

In [144]:
df_cleaned = clean_universities(df)
#df_cleaned[df_cleaned['state'] == 'OR']
df_cleaned[df_cleaned['nation'] == 'United States'].groupby('state').mean().sort_values(by = 'is_r1_public', ascending = False).index[0]


'AL'

In [122]:
df_cleaned.sort_values(by = 'quality_of_education').index[df_cleaned.shape[0] - 200:df_cleaned.shape[0]]

Int64Index([302, 303, 304, 305, 306, 282, 277, 276, 272, 224,
            ...
            522, 523, 525, 526, 527, 528, 529, 531, 521, 999],
           dtype='int64', length=200)

In [126]:
df_cleaned = clean_universities(df)
df_cleaned.loc[df_cleaned.sort_values(by = 'quality_of_education').index[df_cleaned.shape[0] - 200:df_cleaned.shape[0]], 'publications'].mean()

534.945

In [102]:
df_cleaned = clean_universities(df)
df_cleaned[df_cleaned['citations'] <= 500].groupby('nation').mean().sort_values(by = 'world_rank').index[0]
#df_cleaned.dtypes

'Singapore'

In [109]:
df = pd.read_csv('/home/v/Documents/github_repos/dsc80-2022-sp/labs/03-hyp-dataframes/data/universities_unified.csv')
df[df['citations'] <= 500].groupby('nation').count()

KeyError: 'nation'

In [79]:
df = pd.read_csv('/home/v/Documents/github_repos/dsc80-2022-sp/labs/03-hyp-dataframes/data/universities_unified.csv')
df[:50]
df['institution'] = df['institution'].str.replace('\n',', ')
df['broad_impact'] = df['broad_impact'].astype(int)
df[:50]#.dtypes


splits = df['national_rank'].str.split(',')
splits
df[['nation', 'national_rank_cleaned']] = pd.DataFrame(splits.tolist())
df['national_rank_cleaned'] = df['national_rank_cleaned'].astype(int)
df['nation'] = df['nation'].apply(clean_3)
df = df.drop(columns = ['national_rank'])
df.dtypes

bool_df = pd.DataFrame(index = df.index)

bool_df['control_bool'] = df['control'].apply(convert_0_1)
bool_df['city_bool'] = df['city'].apply(convert_0_1)
bool_df['state_bool'] = df['state'].apply(convert_0_1)
bool_df[1:50]              
    #,df['city'].apply(convert_0_1),df['state'].apply(convert_0_1) ])
bool_df['public'] = df['control'].apply(public)
bool_df = bool_df.sum(axis = 1)
bool_df = bool_df == 4
df['is_r1_public'] = bool_df
df

Unnamed: 0,world_rank,institution,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,control,city,state,nation,national_rank_cleaned,is_r1_public
0,1,Harvard University,1,1,1,1,1,1,1,3,100.00,Private (non-profit),Cambridge,MA,United States,1,False
1,2,Stanford University,9,2,4,5,3,3,4,10,98.66,Private (non-profit),Stanford,CA,United States,2,False
2,3,Massachusetts Institute of Technology,3,11,2,15,2,2,2,1,97.54,Private (non-profit),Cambridge,MA,United States,3,False
3,4,University of Cambridge,2,10,5,11,6,12,13,48,96.81,,,,United Kingdom,1,False
4,5,University of Oxford,7,13,10,7,12,7,9,15,96.46,,,,United Kingdom,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,University of the Algarve,367,567,218,926,845,812,969,816,44.03,,,,Portugal,7,False
996,997,Alexandria University,236,566,218,997,908,645,981,871,44.03,,,,Egypt,4,False
997,998,Federal University of Ceará,367,549,218,830,823,812,975,824,44.03,,,,Brazil,18,False
998,999,University of A Coruña,367,567,218,886,974,812,975,651,44.02,,,,Spain,40,False


In [66]:
df[1:50]

Unnamed: 0,world_rank,institution,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,control,city,state,nation,national_rank_cleaned
1,2,Stanford University,9,2,4,5,3,3,4,10,98.66,Private (non-profit),Stanford,CA,United States,2
2,3,Massachusetts Institute of Technology,3,11,2,15,2,2,2,1,97.54,Private (non-profit),Cambridge,MA,United States,3
3,4,University of Cambridge,2,10,5,11,6,12,13,48,96.81,,,,United Kingdom,1
4,5,University of Oxford,7,13,10,7,12,7,9,15,96.46,,,,United Kingdom,2
5,6,Columbia University,13,6,9,13,13,11,12,4,96.14,Private (non-profit),New York,NY,United States,4
6,7,"University of California, Berkeley",5,21,6,10,4,4,7,29,92.25,Public,Berkeley,CA,United States,5
7,8,University of Chicago,11,14,8,17,16,12,22,141,90.7,Private (non-profit),Chicago,IL,United States,6
8,9,Princeton University,4,15,3,72,25,24,33,225,89.42,Private (non-profit),Princeton,NJ,United States,7
9,10,Cornell University,12,18,14,24,15,25,22,11,86.79,Private (non-profit),Ithaca,NY,United States,8
10,11,Yale University,10,26,11,18,8,35,20,49,86.61,Private (non-profit),New Haven,CT,United States,9


In [169]:
df

Unnamed: 0,world_rank,institution,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,control,city,state
0,1,Harvard University,"USA, 1",1,1,1,1,1,1,1.0,3,100.00,Private (non-profit),Cambridge,MA
1,2,Stanford University,"USA, 2",9,2,4,5,3,3,4.0,10,98.66,Private (non-profit),Stanford,CA
2,3,Massachusetts Institute of Technology,"USA, 3",3,11,2,15,2,2,2.0,1,97.54,Private (non-profit),Cambridge,MA
3,4,University of Cambridge,"UK, 1",2,10,5,11,6,12,13.0,48,96.81,,,
4,5,University of Oxford,"United Kingdom, 2",7,13,10,7,12,7,9.0,15,96.46,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,University of the Algarve,"Portugal, 7",367,567,218,926,845,812,969.0,816,44.03,,,
996,997,Alexandria University,"Egypt, 4",236,566,218,997,908,645,981.0,871,44.03,,,
997,998,Federal University of Ceará,"Brazil, 18",367,549,218,830,823,812,975.0,824,44.03,,,
998,999,University of A Coruña,"Spain, 40",367,567,218,886,974,812,975.0,651,44.02,,,


In [168]:
clean_universities(df)

Unnamed: 0,world_rank,institution,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,control,city,state,nation,national_rank_cleaned,is_r1_public
0,1,Harvard University,1,1,1,1,1,1,1,3,100.00,Private (non-profit),Cambridge,MA,United States,1,False
1,2,Stanford University,9,2,4,5,3,3,4,10,98.66,Private (non-profit),Stanford,CA,United States,2,False
2,3,Massachusetts Institute of Technology,3,11,2,15,2,2,2,1,97.54,Private (non-profit),Cambridge,MA,United States,3,False
3,4,University of Cambridge,2,10,5,11,6,12,13,48,96.81,,,,United Kingdom,1,False
4,5,University of Oxford,7,13,10,7,12,7,9,15,96.46,,,,United Kingdom,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,University of the Algarve,367,567,218,926,845,812,969,816,44.03,,,,Portugal,7,False
996,997,Alexandria University,236,566,218,997,908,645,981,871,44.03,,,,Egypt,4,False
997,998,Federal University of Ceará,367,549,218,830,823,812,975,824,44.03,,,,Brazil,18,False
998,999,University of A Coruña,367,567,218,886,974,812,975,651,44.02,,,,Spain,40,False


In [72]:
def public(word):
    if (~pd.isnull(word)) & (word == 'Public'):
        return 1
    else:
        return 0

In [59]:
def convert_0_1(word):
    if pd.isnull(word):
        return 0
    else:
        return 1

In [43]:
def clean_3(word):
    if(word == 'USA'):
        return 'United States'
    elif word == 'UK':
        return 'United Kingdom'
    elif word == 'Czechia':
        return 'Czech Republic'
    else:
        return word

In [166]:
clean_universities(df).dtypes

world_rank                 int64
institution               object
quality_of_education       int64
alumni_employment          int64
quality_of_faculty         int64
publications               int64
influence                  int64
citations                  int64
broad_impact               int64
patents                    int64
score                    float64
control                   object
city                      object
state                     object
nation                    object
national_rank_cleaned      int64
is_r1_public                bool
dtype: object

In [198]:
# don't change this cell -- it is needed for the tests to work
fp = os.path.join('data', 'universities_unified.csv')
df = pd.read_csv(fp)
cleaned = clean_universities(df)
info = university_info(cleaned)

In [164]:
grader.check("q2")

### Question 3 – High Standards ™️ 

#### `std_scores_by_nation` 

Create a function `std_scores_by_nation` that takes in a **cleaned** DataFrame, like the one returned by `clean_universities`, and outputs a DataFrame: 
- with the same rows as the input, 
- with three columns: `'institution'`, `'nation'`, and `'score'` (in that order),
- where the `'score'` column is **standardized** by `'nation'` - that is, the `'score'`s for each country are converted to standard units, using the mean and standard deviation of the `'score'`s for that country. If a `'score'` is `np.NaN`, leave it as `np.NaN`.
    - For a review of standard units, see [Computational and Inferential Thinking](https://www.inferentialthinking.com/chapters/15/1/Correlation).
    - ***Hint:*** Use [`groupby` and `transform`](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html).

<br>

#### `su_and_spread`

Lastly, create a function `su_and_spread` that returns the answers to the following two questions, as a list.

****Part 1****

Let's compare rankings between two countries – the US 🇺🇸 and Canada 🇨🇦. There are in total $n$ universities in the US and $m$ universities in Canada. Suppose $x_1, x_2, ..., x_n$ are the `'world_rank'`s for US universities in **increasing order**, meaning that $x_1$ is the `'world_rank'` of the "best" US university. Similarly, $y_1, y_2, ..., y_m$ are the `'world_rank'`s for Canadian universities, also in increasing order. 

Suppose we take the aforementioned `'world_rank'`s and sort them together in **increasing order**, e.g. $x_1, x_2, y_1, x_3, ...$. **We define $R$ to be the average of the positions of the $x$ values.**

For example, if there are 3 US universities (so $n=3$) and 2 Canadian universities ($m=2$), and
  
$$x_1 = 1, x_2 = 3, x_3 = 10, \:\:\:\: y_1 = 5, y_2 = 15$$

When we sort the rankings in increasing order, we'd get 1, 3, 5, 10, 15, which correspond to the values $x_1, x_2, y_1, x_3, y_2$. The $x$ values are at positions 1, 2, and 4. Then, $R = \frac{1 + 2 + 4}{3} = \frac{7}{3}$. (Note that this is **not** the average of 1, 3, and 10).


**Question:** If we believe that US universities in general rank higher than Canadian universities, should $R$ be
1. larger than $\frac{m + n}{2}$?
2. smaller than $\frac{m + n}{2}$?
3. equal to $\frac{m + n}{2}$?


Store your answer – either 1, 2, or 3 – in the first element of `su_and_spread`'s output list. Note that this is a classical example of a non-parametric hypothesis test called a rank test.

<br>

****Part 2****

Which `'nation'` has the largest variation in `'score'`s before standardization? 

***Note:*** To find the answer to Part 2, you'll need to find the standard deviation of a column. You should use the formula with `n` in the denominator. `numpy`'s `.std()` by default uses that formula, while `pandas`' `.std()` by default uses the formula with `n-1` in the denominator. To force `pandas`' `.std()` to use `n` in the denominator, use the optional argument `ddof=0`.

Unnamed: 0,world_rank,institution,national_rank,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,control,city,state
0,1,Harvard University,"USA, 1",1,1,1,1,1,1,1.0,3,100.00,Private (non-profit),Cambridge,MA
1,2,Stanford University,"USA, 2",9,2,4,5,3,3,4.0,10,98.66,Private (non-profit),Stanford,CA
2,3,Massachusetts Institute of Technology,"USA, 3",3,11,2,15,2,2,2.0,1,97.54,Private (non-profit),Cambridge,MA
3,4,University of Cambridge,"UK, 1",2,10,5,11,6,12,13.0,48,96.81,,,
4,5,University of Oxford,"United Kingdom, 2",7,13,10,7,12,7,9.0,15,96.46,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,University of the Algarve,"Portugal, 7",367,567,218,926,845,812,969.0,816,44.03,,,
996,997,Alexandria University,"Egypt, 4",236,566,218,997,908,645,981.0,871,44.03,,,
997,998,Federal University of Ceará,"Brazil, 18",367,549,218,830,823,812,975.0,824,44.03,,,
998,999,University of A Coruña,"Spain, 40",367,567,218,886,974,812,975.0,651,44.02,,,


In [192]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two'],
                   'C' : [1, 5, 5, 2, 5, 5],
                   'D' : [2.0, 5., 8., 1., 2., 9.]})

df

Unnamed: 0,A,B,C,D
0,foo,one,1,2.0
1,bar,one,5,5.0
2,foo,two,5,8.0
3,bar,three,2,1.0
4,foo,two,5,2.0
5,bar,two,5,9.0


In [195]:
grouped = df.groupby('A')
grouped.transform(lambda x: (x - x.mean()) / x.std())

  grouped.transform(lambda x: (x - x.mean()) / x.std())


Unnamed: 0,C,D
0,-1.154701,-0.57735
1,0.57735,0.0
2,0.57735,1.154701
3,-1.154701,-1.0
4,0.57735,-0.57735
5,0.57735,1.0


In [212]:
clean_universities(df)

Unnamed: 0,world_rank,institution,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,control,city,state,nation,national_rank_cleaned,is_r1_public
0,1,Harvard University,1,1,1,1,1,1,1,3,100.00,Private (non-profit),Cambridge,MA,United States,1,False
1,2,Stanford University,9,2,4,5,3,3,4,10,98.66,Private (non-profit),Stanford,CA,United States,2,False
2,3,Massachusetts Institute of Technology,3,11,2,15,2,2,2,1,97.54,Private (non-profit),Cambridge,MA,United States,3,False
3,4,University of Cambridge,2,10,5,11,6,12,13,48,96.81,,,,United Kingdom,1,False
4,5,University of Oxford,7,13,10,7,12,7,9,15,96.46,,,,United Kingdom,2,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,University of the Algarve,367,567,218,926,845,812,969,816,44.03,,,,Portugal,7,False
996,997,Alexandria University,236,566,218,997,908,645,981,871,44.03,,,,Egypt,4,False
997,998,Federal University of Ceará,367,549,218,830,823,812,975,824,44.03,,,,Brazil,18,False
998,999,University of A Coruña,367,567,218,886,974,812,975,651,44.02,,,,Spain,40,False


In [242]:
df = pd.read_csv('/home/v/Documents/github_repos/dsc80-2022-sp/labs/03-hyp-dataframes/data/universities_unified.csv')

0      4.627265
1      4.502980
2      4.399100
3      5.210247
4      5.172883
         ...   
995   -1.059581
996   -0.624500
997   -0.481783
998   -0.765489
999   -0.540741
Name: score, Length: 1000, dtype: float64

In [239]:
std_scores_by_nation(clean_universities(df))

  >>> len(out) == 2


Unnamed: 0,institution,nation,score
0,Harvard University,United States,4.627265
1,Stanford University,United States,4.502980
2,Massachusetts Institute of Technology,United States,4.399100
3,University of Cambridge,United Kingdom,5.210247
4,University of Oxford,United Kingdom,5.172883
...,...,...,...
995,University of the Algarve,Portugal,-1.059581
996,Alexandria University,Egypt,-0.624500
997,Federal University of Ceará,Brazil,-0.481783
998,University of A Coruña,Spain,-0.765489


In [238]:
df_cleaned = clean_universities(df)
df_cleaned2 = pd.DataFrame(data = df_cleaned['institution'] , index = df_cleaned.index)
df_cleaned2['nation'] = df_cleaned['nation']
df_cleaned2['score'] = df_cleaned['score']

df_cleaned2 = df_cleaned2.groupby('nation')
df2 = df_cleaned2.transform(lambda x: (x - x.mean()) / x.std())


df2['score']
#df_cleaned2.set_index('nation')

#std = df_cleaned[df_cleaned['nation'] == 'China']['score'].std()
#mean = df_cleaned[df_cleaned['nation'] == 'China']['score'].mean()

#df
#(44.02 - mean) / std

df_cleaned3 = pd.DataFrame(data = df_cleaned['institution'] , index = df_cleaned.index)
df_cleaned3['nation'] = df_cleaned['nation']
df_cleaned3['score'] = df2['score']
df_cleaned3

  df2 = df_cleaned2.transform(lambda x: (x - x.mean()) / x.std())


Unnamed: 0,institution,nation,score
0,Harvard University,United States,4.627265
1,Stanford University,United States,4.502980
2,Massachusetts Institute of Technology,United States,4.399100
3,University of Cambridge,United Kingdom,5.210247
4,University of Oxford,United Kingdom,5.172883
...,...,...,...
995,University of the Algarve,Portugal,-1.059581
996,Alexandria University,Egypt,-0.624500
997,Federal University of Ceará,Brazil,-0.481783
998,University of A Coruña,Spain,-0.765489


In [254]:
df_cleaned[df_cleaned['nation'] == 'United States']

Unnamed: 0,world_rank,institution,quality_of_education,alumni_employment,quality_of_faculty,publications,influence,citations,broad_impact,patents,score,control,city,state,nation,national_rank_cleaned,is_r1_public
0,1,Harvard University,1,1,1,1,1,1,1,3,100.00,Private (non-profit),Cambridge,MA,United States,1,False
1,2,Stanford University,9,2,4,5,3,3,4,10,98.66,Private (non-profit),Stanford,CA,United States,2,False
2,3,Massachusetts Institute of Technology,3,11,2,15,2,2,2,1,97.54,Private (non-profit),Cambridge,MA,United States,3,False
5,6,Columbia University,13,6,9,13,13,11,12,4,96.14,Private (non-profit),New York,NY,United States,4,False
6,7,"University of California, Berkeley",5,21,6,10,4,4,7,29,92.25,Public,Berkeley,CA,United States,5,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900,901,University of Southern Mississippi,367,567,218,913,853,812,850,676,44.13,Public,Hattiesburg,MS,United States,225,True
905,906,Oakland University,367,567,218,888,810,812,850,871,44.13,,,,United States,226,False
912,913,University of North Dakota,367,567,218,917,731,812,867,606,44.12,,,,United States,227,False
928,929,University of Texas at El Paso,367,442,218,910,838,812,906,706,44.10,Public,El Paso,TX,United States,228,True


In [271]:
df_cleaned = clean_universities(df)
df_cleaned2 = pd.DataFrame(data = df_cleaned['institution'] , index = df_cleaned.index)
df_cleaned2['nation'] = df_cleaned['nation']
df_cleaned2['score'] = df_cleaned['score']
df_cleaned2 = df_cleaned2.groupby('nation')

df2 = df_cleaned2.transform(lambda x: ((x - x.mean())**2).sum() / x.count())
df2.sort_values(by = 'score')
df_cleaned.iloc[0]

  df2 = df_cleaned2.transform(lambda x: ((x - x.mean())**2).sum() / x.count())


world_rank                                  1
institution                Harvard University
quality_of_education                        1
alumni_employment                           1
quality_of_faculty                          1
publications                                1
influence                                   1
citations                                   1
broad_impact                                1
patents                                     3
score                                   100.0
control                  Private (non-profit)
city                                Cambridge
state                                      MA
nation                          United States
national_rank_cleaned                       1
is_r1_public                            False
Name: 0, dtype: object

In [262]:
mean = df_cleaned[df_cleaned['nation'] == 'United Kingdom']['score'].mean()
((df_cleaned[df_cleaned['nation'] == 'United Kingdom']['score'] - mean) ** 2).sum() / df_cleaned[df_cleaned['nation'] == 'United Kingdom']['score'].shape[0]

86.39436023668638

In [250]:
df_cleaned = clean_universities(df)
df_cleaned.groupby('nation').

AttributeError: 'DataFrameGroupBy' object has no attribute 'values'

In [None]:
def

In [276]:
# do not edit this cell -- it is needed for the tests
fp = os.path.join('data', 'universities_unified.csv')
universities = pd.read_csv(fp)
cleaned = clean_universities(universities)
universities_out = std_scores_by_nation(cleaned)
su_and_spread_out = su_and_spread()

  df2 = df_cleaned2.transform(lambda x: (x - x.mean()) / x.std())


In [277]:
grader.check("q3")

## Part 3: Combining Data

### Question 4 – Making Connections 🤝

A group of students decided to send out a survey to their connections on LinkedIn. Each student asks 1000 of their connections for their first and last name, the company they currently work at, their job title, their email, and the university they attended.

**Your job is to combine all the data contained in the files `survey*.csv` (stored within the `data/responses` folder) into a single DataFrame. The number of files and the number of rows in each file may vary, so don't hardcode your answers!** To do so, implement the following two functions.

#### `read_linkedin_survey`

Create a function `read_linkedin_survey` which takes in a string describing the path to a folder containing `survey*.csv` files and outputs a DataFrame with six columns titled `'first name'`, `'last name'`, `'current company'`, `'job title'`, `'email'`, and `'university'` (in that order) containing the survey information for all files combined. Make sure to reset the index of the combined DataFrame before returning it so that the index is unique. 

***Hints***:

- Take a look at a few of the files in the `responses` folder. You may have to do some data cleaning to combine the DataFrames!

- You can list the files in a directory using `os.listdir`.

<br>

#### `com_stats`

Create a function `com_stats` which takes in a DataFrame returned by `read_linkedin_survey` and returns a list containing, in the following order: 
- The number of employees at the company that hired the most employees
- The number of email addresses that **end in** `'.edu'`
- The job title that has the longest name (there are no ties)
- The number of managers (a manager is anyone who has the word `'manager'` in their job title, uppercase or lowercase)

In [449]:
type(com_stats(read_linkedin_survey(dirname))[0])

numpy.int64

In [443]:
dupe = big['job title']
dupe = dupe.astype(str).apply(str.lower)
dupe[dupe.str.contains('manager')].shape[0]

369

In [415]:
def get_length(word):
    return len(str(word))

In [433]:
idx = big['job title'].astype(str).apply(len).sort_values(ascending = False).index[0]
big.loc[idx, 'job title']
#big.iloc[2475]

'Business Systems Development Analyst'

In [405]:
(big['email'].str.endswith('.edu') == True).sum()

253

In [None]:
big[big['email'].]

In [432]:
big = read_linkedin_survey(dirname)
big['current company'].value_counts().sort_values(ascending = False)[0]

5

In [None]:
current company

In [618]:
dirname = os.path.join('data', 'responses')
file_list = os.listdir(dirname)
file_list



['survey2.csv', 'survey3.csv', 'survey1.csv', 'survey4.csv', 'survey5.csv']

In [298]:
dfs = np.array([])
for i in surveys:
    dfs = np.append(dfs, pd.read_csv(os.getcwd() + '/data/responses/' + i))
dfs[0]

'Harvey Inc'

In [376]:
x = 10
y = x
x = 11
y
pd.set_option('display.max_rows', None)

In [619]:
read_linkedin_survey(dirname)

Unnamed: 0,first name,last name,current company,job title,email,university
0,Ardelia,Winspurr,Harvey Inc,Safety Technician IV,awinspurr0@timesonline.co.uk,Universidad Autónoma de Yucatán
1,Ileane,Balhatchet,Johnston-Hermann,Structural Engineer,ibalhatchet1@fastcompany.com,Technical University of Opole
2,Damita,Seamer,Dibbert-Lemke,Human Resources Assistant III,dseamer2@elegantthemes.com,Osaka City University
3,Krystal,Clerc,"Rutherford, Schiller and Skiles",Staff Accountant III,kclerc3@lulu.com,"DeVry Institute of Technology, Decatur"
4,Kirsti,Raithbie,"Luettgen, Anderson and Green",Automation Specialist III,kraithbie4@liveinternet.ru,University of Maryland Medicine
5,Ingrid,Louis,"Goldner, Skiles and Huels",Project Manager,ilouis5@slashdot.org,Nuclear Institute for Agriculture and Biology ...
6,Lesli,Murdoch,Nader LLC,Occupational Therapist,lmurdoch6@canalblog.com,Mahidol University
7,Xavier,Risbie,"Wilderman, Stanton and Orn",Graphic Designer,xrisbie7@nifty.com,Ranchi University
8,Bernadene,Lisamore,Cruickshank Group,Recruiting Manager,blisamore8@phpbb.com,Joetsu University of Education
9,Filippa,Gerren,"Schneider, Mante and O'Kon",Senior Quality Engineer,fgerren9@google.co.uk,Sadat Institute of Higher Education


In [388]:
list_of_dfs = [pd.read_csv(os.path.join(dirname, file)) for file in file_list]
list_of_dfs[1]

df1 = list_of_dfs[0]
df1

for i in list_of_dfs:
    old_cols = i.columns
    cols = []
    for j in old_cols: 
        cols.append(j.lower().replace('_', ' '))
        
        #print(j)
    print(cols)
    i.columns = cols
#list_of_dfs[2]
list_of_dfs[4]
big_list = pd.concat(list_of_dfs)
big_list = big_list.reset_index().drop(columns = 'index')

sorted_cols_df = pd.DataFrame(data = big_list['first name'], index = big_list.index)
sorted_cols_df['last name'] = big_list['last name']
sorted_cols_df['current company'] = big_list['current company']
sorted_cols_df['job title'] = big_list['job title']
sorted_cols_df['email'] = big_list['email']
sorted_cols_df['university'] = big_list['university']
sorted_cols_df

['current company', 'job title', 'first name', 'last name', 'email', 'university']
['current company', 'email', 'first name', 'last name', 'job title', 'university']
['first name', 'last name', 'job title', 'email', 'current company', 'university']
['current company', 'email', 'first name', 'last name', 'job title', 'university']
['email', 'first name', 'last name', 'job title', 'university', 'current company']


Unnamed: 0,first name,last name,current company,job title,email,university
0,Ardelia,Winspurr,Harvey Inc,Safety Technician IV,awinspurr0@timesonline.co.uk,Universidad Autónoma de Yucatán
1,Ileane,Balhatchet,Johnston-Hermann,Structural Engineer,ibalhatchet1@fastcompany.com,Technical University of Opole
2,Damita,Seamer,Dibbert-Lemke,Human Resources Assistant III,dseamer2@elegantthemes.com,Osaka City University
3,Krystal,Clerc,"Rutherford, Schiller and Skiles",Staff Accountant III,kclerc3@lulu.com,"DeVry Institute of Technology, Decatur"
4,Kirsti,Raithbie,"Luettgen, Anderson and Green",Automation Specialist III,kraithbie4@liveinternet.ru,University of Maryland Medicine
5,Ingrid,Louis,"Goldner, Skiles and Huels",Project Manager,ilouis5@slashdot.org,Nuclear Institute for Agriculture and Biology ...
6,Lesli,Murdoch,Nader LLC,Occupational Therapist,lmurdoch6@canalblog.com,Mahidol University
7,Xavier,Risbie,"Wilderman, Stanton and Orn",Graphic Designer,xrisbie7@nifty.com,Ranchi University
8,Bernadene,Lisamore,Cruickshank Group,Recruiting Manager,blisamore8@phpbb.com,Joetsu University of Education
9,Filippa,Gerren,"Schneider, Mante and O'Kon",Senior Quality Engineer,fgerren9@google.co.uk,Sadat Institute of Higher Education


In [451]:
# do not edit this cell -- it is needed for the tests
dirname = os.path.join('data', 'responses')
out = read_linkedin_survey(dirname)
stats_out = com_stats(out)

In [452]:
grader.check("q4")

### Question 5 – Survey Says... 👨‍👩‍👧‍👦

Professor Billy often sends out extra credit surveys asking students for their favorite animals, movies, and other favorite things. These surveys are stored in the `data/extra-credit-surveys` folder. Each file in that folder corresponds to a different survey question (except for `favorite1.csv`, which contains students' names and IDs).

Here's how extra credit works:
- Each student who has completed at least 75% of the survey questions receives 5 points of extra credit.
- If there is at least one survey question that at least 90% of the class answered (e.g. favorite animal), **everyone** in the class receives 1 point of extra credit. This overall class extra credit only applies once, so if for example 95% of students answer the favorite color survey question and 91% answer the favorite animal survey question, the entire class still only receives 1 extra point as a class).
- Note that this means that the most extra credit any student can earn is 6 points.

#### `read_student_surveys`

Create a function `read_student_surveys` which takes in a string describing the path to a folder containing `favorite*.csv` files and outputs a DataFrame containing all of the survey data combined, indexed by student ID (a value 1-1000).

<br>

#### `check_credit`

Create a function `check_credit` which takes in a DataFrame returned by `read_student_surveys` and outputs a DataFrame indexed by student ID (a value 1-1000) with two columns:
- `'name'`, containing the name of each student, and
- `'ec'`, containing the number of extra credit points each student earned.

In [None]:
np.clip()

In [501]:
read_student_surveys(dirname)

Unnamed: 0_level_0,plant,color,name,movie,genre,animal
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,,Red,Myrtia,,(no genres listed),Long-crested hawk eagle
2,,Khaki,Nathanil,,Documentary,Euro wallaby
3,,Red,Joni,"Glass-blower's Children, The (Glasblåsarns barn)",,Brown brocket
4,,Yellow,Prentice,,(no genres listed),"Peccary, white-lipped"
5,,Fuscia,Claudette,,,"Capuchin, brown"
6,,Maroon,Obed,,Drama,"Parrot, hawk-headed"
7,,Puce,Bryna,Private Detective 62,Drama|Romance,"Weaver, sociable"
8,,Khaki,Cati,,Drama|Romance,
9,,Green,Marilyn,,Drama|Mystery|Thriller,
10,,Green,Anni,,Drama|Sci-Fi|Thriller,


In [480]:
names = big_df['name']
#names
big_df
big_df = big_df.drop(columns = ['name']).applymap(apply_nan)


In [503]:
big_df.shape

(1000, 5)

In [506]:
check_credit(read_student_surveys(dirname))

Unnamed: 0_level_0,name,ec
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Myrtia,1
2,Nathanil,1
3,Joni,1
4,Prentice,1
5,Claudette,1
6,Obed,1
7,Bryna,6
8,Cati,1
9,Marilyn,1
10,Anni,1


In [505]:
big_df

class_extra = big_df.sum(axis = 0)
class_extra = class_extra / big_df.shape[0]
class_extra = class_extra[class_extra >= 0.90].shape[0]

if(class_extra >= 1):
    class_extra = 1
else:
    class_extra = 0

indiv = big_df.sum(axis = 1)
indiv = indiv / len(big_df.columns)

totals = ((indiv >= .75) * 5) + class_extra
totals
#(indiv >= .75)

final = pd.DataFrame(data = names, index = big_df.index)
final['ec'] = totals
final

Unnamed: 0_level_0,name,ec
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Myrtia,1
2,Nathanil,1
3,Joni,1
4,Prentice,1
5,Claudette,1
6,Obed,1
7,Bryna,6
8,Cati,1
9,Marilyn,1
10,Anni,1


In [494]:
False * 5

0

In [462]:
def apply_nan(word):
    if pd.isnull(word):
        return 0
    else:
        return 1

In [454]:
dirname = os.path.join('data', 'extra-credit-surveys')
file_list = os.listdir(dirname)
file_list

['favorite5.csv',
 'favorite6.csv',
 'favorite1.csv',
 'favorite2.csv',
 'favorite3.csv',
 'favorite4.csv']

In [510]:
list_of_dfs = [pd.read_csv(os.path.join(dirname, file)) for file in file_list]
#list_of_dfs[0]

big_df = list_of_dfs[0]
for i in np.arange(1,len(list_of_dfs)):
    big_df = big_df.merge(list_of_dfs[i], on = 'id' )
big_df = big_df.set_index('id')

In [461]:
read_student_surveys(dirname)

Unnamed: 0_level_0,plant,color,name,movie,genre,animal
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,,Red,Myrtia,,(no genres listed),Long-crested hawk eagle
2,,Khaki,Nathanil,,Documentary,Euro wallaby
3,,Red,Joni,"Glass-blower's Children, The (Glasblåsarns barn)",,Brown brocket
4,,Yellow,Prentice,,(no genres listed),"Peccary, white-lipped"
5,,Fuscia,Claudette,,,"Capuchin, brown"
6,,Maroon,Obed,,Drama,"Parrot, hawk-headed"
7,,Puce,Bryna,Private Detective 62,Drama|Romance,"Weaver, sociable"
8,,Khaki,Cati,,Drama|Romance,
9,,Green,Marilyn,,Drama|Mystery|Thriller,
10,,Green,Anni,,Drama|Sci-Fi|Thriller,


In [511]:
# do not edit this cell -- it is needed for the tests
dirname = os.path.join('data', 'extra-credit-surveys')
out = read_student_surveys(dirname)
check_credit_out = check_credit(out)

In [512]:
grader.check("q5")

### Question 6 – Paw Patrol 🐾

You are analyzing data from a veterinarian clinic. The datasets contain several types of information from the clinic, including its customers (pet owners), pets, available procedures, and procedure history. The column names are self-explanatory. These DataFrames are provided to you:
-  `owners` stores the customer information, where every `'OwnerID'` is unique (verify this yourself).
-  `pets` stores the pet information. Each pet belongs to a customer in `owners`.
-  `procedure_detail` contains a catalog of procedures that are offered by the clinic.
-  `procedure_history` has procedure records. Each procedure was given to a pet in `pets`.

<br>

Implement the following three functions, which each ask you to answer a specific question.

#### `most_popular_procedure`

What is the most popular `'ProcedureType'` amongst all pets in the `pets` DataFrame? Create a function `most_popular_procedure` that takes in two DataFrames, `pets` and `procedure_history`, and returns the name of the most popular `'ProcedureType'` as a string.

Note that some pets are registered but haven't had any procedures performed. Also, some pets that have had procedures done are not registered in `pets`.


<br>

#### `pet_name_by_owner`

What is the name of each customer's pet(s)? Create a function `pet_name_by_owner` that takes in two DataFrames, `owners` and `pets`, and returns a Series whose index contains owner first names, and whose values are pet names as **strings**. If an owner has multiple pets, the value corresponding to that owner should instead be a **list of pet names as strings**.

Note that owner first names are not necessarily unique, and so the Series you return will not necessarily have a unique index.

<br>

#### `total_cost_per_city`

Note that the `owners` DataFrame has a `'City'` column, describing the city in which each pet owner and their pets live. How much did each city spend in total on procedures? Create a function `total_cost_per_city` that takes in four DataFrames, `owners`, `pets`, `procedure_history`, and `procedure_detail`, and returns a Series indexed by `'City'` that describes the total amount that each city has spent on pets' procedures.

***Hint:*** At some point, you may have to merge on multiple columns.

In [599]:
owner_fp = os.path.join('data', 'pets', 'Owners.csv')
pets_fp = os.path.join('data', 'pets', 'Pets.csv')
history_fp = os.path.join('data', 'pets', 'ProceduresHistory.csv')
details_fp = os.path.join('data', 'pets', 'ProceduresDetails.csv')

pets = pd.read_csv(pets_fp)
owners = pd.read_csv(owner_fp)
history = pd.read_csv(history_fp)
details = pd.read_csv(details_fp)

In [608]:
history_wdetails

Unnamed: 0,PetID,Date,ProcedureType,ProcedureSubCode,Description,Price
0,A8-1181,2016-01-10,VACCINATIONS,5,Rabies,10
1,E7-3766,2016-01-11,VACCINATIONS,5,Rabies,10
2,B8-8740,2016-01-11,VACCINATIONS,5,Rabies,10
3,D4-9443,2016-01-11,VACCINATIONS,5,Rabies,10
4,E2-6642,2016-01-12,VACCINATIONS,5,Rabies,10
5,A4-1165,2016-01-12,VACCINATIONS,5,Rabies,10
6,F9-9345,2016-01-12,VACCINATIONS,5,Rabies,10
7,F3-9375,2016-01-12,VACCINATIONS,5,Rabies,10
8,D2-8905,2016-01-13,VACCINATIONS,5,Rabies,10
9,E8-8379,2016-01-13,VACCINATIONS,5,Rabies,10


In [609]:
pets_wdetails

Unnamed: 0,PetID,Name,Kind,Gender,Age,OwnerID,Date,ProcedureType,ProcedureSubCode,Description,Price
0,J6-8562,Blackie,Dog,male,11,5168,2016-08-21,GENERAL SURGERIES,8,Umbilical,175
1,M0-2904,Simba,Cat,male,1,3086,2016-07-22,VACCINATIONS,5,Rabies,10
2,P2-7342,Cuddles,Dog,male,13,4378,2016-10-05,VACCINATIONS,5,Rabies,10
3,X0-8765,Vuitton,Parrot,female,11,7581,2016-03-18,VACCINATIONS,5,Rabies,10
4,X0-8765,Vuitton,Parrot,female,11,7581,2016-10-03,GENERAL SURGERIES,8,Umbilical,175
5,M8-7852,Cookie,Cat,female,8,7606,2016-09-19,VACCINATIONS,5,Rabies,10
6,U4-9376,Scout,Dog,female,2,7846,2016-12-10,VACCINATIONS,5,Rabies,10
7,F6-5391,Cookie,Cat,female,9,5508,2016-12-21,GROOMING,1,Bath,15
8,T0-5705,Biscuit,Dog,female,5,5833,2016-11-04,VACCINATIONS,5,Rabies,10
9,P0-1725,Lily,Dog,female,0,2419,2016-07-06,GROOMING,3,Flea Spray,10


In [None]:
total_cost_per_city(owners, pets, )

In [612]:
history_wdetails = history.merge(details, on = ['ProcedureType','ProcedureSubCode'])
history_wdetails

pets_wdetails = pets.merge(history_wdetails, on = 'PetID', how = 'inner')
pets_wdetails


with_cities = pets_wdetails.merge(owners, on = 'OwnerID')
with_cities
#with_cities.groupby('City').sum()['Price']

Unnamed: 0,PetID,Name_x,Kind,Gender,Age,OwnerID,Date,ProcedureType,ProcedureSubCode,Description,Price,Name_y,Surname,StreetAddress,City,State,StateFull,ZipCode
0,J6-8562,Blackie,Dog,male,11,5168,2016-08-21,GENERAL SURGERIES,8,Umbilical,175,Robert,Foster,4680 Rubaiyat Road,Grand Rapids,MI,Michigan,49503
1,M0-2904,Simba,Cat,male,1,3086,2016-07-22,VACCINATIONS,5,Rabies,10,Ed,Enriquez,3413 Reppert Coal Road,Warren,MI,Michigan,48093
2,P2-7342,Cuddles,Dog,male,13,4378,2016-10-05,VACCINATIONS,5,Rabies,10,George,McDonald,4715 Wood Duck Drive,Marquette,MI,Michigan,49855
3,X0-8765,Vuitton,Parrot,female,11,7581,2016-03-18,VACCINATIONS,5,Rabies,10,Florence,Nolen,3103 Howard Street,Grand Rapids,MI,Michigan,49503
4,X0-8765,Vuitton,Parrot,female,11,7581,2016-10-03,GENERAL SURGERIES,8,Umbilical,175,Florence,Nolen,3103 Howard Street,Grand Rapids,MI,Michigan,49503
5,M8-7852,Cookie,Cat,female,8,7606,2016-09-19,VACCINATIONS,5,Rabies,10,Edna,Moreno,2548 Wetzel Lane,Grand Rapids,MI,Michigan,49503
6,U4-9376,Scout,Dog,female,2,7846,2016-12-10,VACCINATIONS,5,Rabies,10,Elvia,Warren,2041 Eagle Drive,Southfield,MI,Michigan,48075
7,F6-5391,Cookie,Cat,female,9,5508,2016-12-21,GROOMING,1,Bath,15,Charles,Swarey,2463 Charles Street,Flint,MI,Michigan,48548
8,T0-5705,Biscuit,Dog,female,5,5833,2016-11-04,VACCINATIONS,5,Rabies,10,Mary,Hurtado,4865 Juniper Drive,Saint Charles,MI,Michigan,48655
9,P0-1725,Lily,Dog,female,0,2419,2016-07-06,GROOMING,3,Flea Spray,10,Luisa,Cuellar,1308 Shingleton Road,Kalamazoo,MI,Michigan,49007


In [555]:
owner_fp = os.path.join('data', 'pets', 'Owners.csv')
pets_fp = os.path.join('data', 'pets', 'Pets.csv')


pets = pd.read_csv(pets_fp)
owners = pd.read_csv(owner_fp)

In [613]:
total_cost_per_city(owners,pets,history, details)

City
Ann Arbor            450
Center Line           10
Commerce              10
Detroit              305
East Lansing          40
Farmington Hills      10
Flint                 15
Grand Rapids        1240
Kalamazoo             10
Lansing               30
Livonia               10
Marquette             50
Michigan Center       10
Plymouth              10
Pontiac               30
Roseville             10
Saint Charles         10
Southfield            65
Warren                10
Wayne                 10
Name: Price, dtype: int64

In [568]:
def split_names(word):
    if word.find(',') != -1:
        return word.split(';')
    else:
        return word

In [596]:
names = owners.merge(pets, on = 'OwnerID', how = 'left', suffixes = ('_owner','_pet'))
owner_names = pd.DataFrame([owners['Name'], owners['OwnerID']]).T
names = names.groupby('OwnerID')['Name_pet'].apply(lambda x : ",".join(x)).reset_index()
#names = names.merge(owner_names, on = )
names['Name_pet'] = names['Name_pet'].apply(split_names)
owner_names
names = names.merge(owner_names, on = 'OwnerID', how = 'left').drop(columns = 'OwnerID').set_index('Name')
#names['Name_pet']
names['Name_pet']
#pd.DataFrame(names)

Name
Jessica                         Biscuit
Rosa                              Stowe
Susan                              Enyo
Benjamin              [Danger,Collette]
Charles                           Rumba
Joe                          Heisenberg
Jason                          Crockett
Joseph                          Blackie
Carolyn                          Cookie
Doris                             Scout
Jeffrey                          Bandit
Christopher                       Rumba
William                          Goethe
Robert                              Taz
Luisa                              Lily
Wm                                Simba
John                              Kashi
Anne                            Natacha
Bruce                             Bruce
John                            Biscuit
Travis                          Houdini
Paul                              Tiger
Ed                                Simba
Lee                 [Bright,Angel,Jake]
Susan                             D

In [537]:
owners

Unnamed: 0,OwnerID,Name,Surname,StreetAddress,City,State,StateFull,ZipCode
0,6049,Debbie,Metivier,315 Goff Avenue,Grand Rapids,MI,Michigan,49503
1,2863,John,Sebastian,3221 Perry Street,Davison,MI,Michigan,48423
2,3518,Connie,Pauley,1539 Cunningham Court,Bloomfield Township,MI,Michigan,48302
3,3663,Lena,Haliburton,4217 Twin Oaks Drive,Traverse City,MI,Michigan,49684
4,1070,Jessica,Velazquez,3861 Woodbridge Lane,Southfield,MI,Michigan,48034
5,7101,Bessie,Yen,30 Cunningham Court,Rochester Hills,MI,Michigan,48306
6,2419,Luisa,Cuellar,1308 Shingleton Road,Kalamazoo,MI,Michigan,49007
7,6194,Karen,Torres,3941 Ritter Avenue,Center Line,MI,Michigan,48015
8,5833,Mary,Hurtado,4865 Juniper Drive,Saint Charles,MI,Michigan,48655
9,9614,Carmen,Ingram,1056 Eagle Drive,Detroit,MI,Michigan,48219


In [538]:
pets

Unnamed: 0,PetID,Name,Kind,Gender,Age,OwnerID
0,J6-8562,Blackie,Dog,male,11,5168
1,Q0-2001,Roomba,Cat,male,9,5508
2,M0-2904,Simba,Cat,male,1,3086
3,R3-7551,Keller,Parrot,female,2,7908
4,P2-7342,Cuddles,Dog,male,13,4378
5,X0-8765,Vuitton,Parrot,female,11,7581
6,Z4-5652,Priya,Cat,female,7,7343
7,Z4-4045,Simba,Cat,male,0,2700
8,M8-7852,Cookie,Cat,female,8,7606
9,J2-3320,Heisenberg,Dog,male,3,1319


In [514]:
pets_fp = os.path.join('data', 'pets', 'Pets.csv')
procedure_history_fp = os.path.join('data', 'pets', 'ProceduresHistory.csv')

pets = pd.read_csv(pets_fp)
procedure_history = pd.read_csv(procedure_history_fp)
pets

Unnamed: 0,PetID,Name,Kind,Gender,Age,OwnerID
0,J6-8562,Blackie,Dog,male,11,5168
1,Q0-2001,Roomba,Cat,male,9,5508
2,M0-2904,Simba,Cat,male,1,3086
3,R3-7551,Keller,Parrot,female,2,7908
4,P2-7342,Cuddles,Dog,male,13,4378
5,X0-8765,Vuitton,Parrot,female,11,7581
6,Z4-5652,Priya,Cat,female,7,7343
7,Z4-4045,Simba,Cat,male,0,2700
8,M8-7852,Cookie,Cat,female,8,7606
9,J2-3320,Heisenberg,Dog,male,3,1319


In [515]:
procedure_history

Unnamed: 0,PetID,Date,ProcedureType,ProcedureSubCode
0,A8-1181,2016-01-10,VACCINATIONS,5
1,E7-3766,2016-01-11,VACCINATIONS,5
2,B8-8740,2016-01-11,VACCINATIONS,5
3,D4-9443,2016-01-11,VACCINATIONS,5
4,F6-3398,2016-01-12,HOSPITALIZATION,1
5,E2-6642,2016-01-12,VACCINATIONS,5
6,A4-1165,2016-01-12,VACCINATIONS,5
7,F9-9345,2016-01-12,VACCINATIONS,5
8,F3-9375,2016-01-12,VACCINATIONS,5
9,F9-5311,2016-01-12,GENERAL SURGERIES,6


In [None]:
most_po

In [534]:
pets_and_hist = pets.merge(procedure_history, on = 'PetID', how = 'left')
pets_and_hist['ProcedureType'].value_counts().sort_values(ascending = False).index[0]

'VACCINATIONS'

In [535]:
most_popular_procedure(pets,procedure_history)

'VACCINATIONS'

In [None]:
# do not edit this cell -- it is needed for the tests
pets_fp = os.path.join('data', 'pets', 'Pets.csv')
procedure_history_fp = os.path.join('data', 'pets', 'ProceduresHistory.csv')
owners_fp = os.path.join('data', 'pets', 'Owners.csv')
procedure_detail_fp = os.path.join('data', 'pets', 'ProceduresDetails.csv')
pets = pd.read_csv(pets_fp)
procedure_history = pd.read_csv(procedure_history_fp)
owners = pd.read_csv(owners_fp)
procedure_detail = pd.read_csv(procedure_detail_fp)

out_01 = most_popular_procedure(pets, procedure_history)
out_02 = pet_name_by_owner(owners, pets)
out_03 = total_cost_per_city(owners, pets, procedure_history, procedure_detail)

In [614]:
grader.check("q6")

## Congratulations! You're done! 🏁

Submit your `.py` file to Gradescope. Note that you only need to submit the `.py` file; this notebook should not be uploaded.

Before submitting, you should ensure that all of your work is in the `.py` file. You can do this by running the doctests below, which will verify that your work passes the public tests **and** that your work is in the `.py` file. Run the cell below; you should see no output.

In [615]:
!python -m doctest lab.py

  df2 = df_cleaned2.transform(lambda x: (x - x.mean()) / x.std())


In addition, `grader.check_all()` will verify that your work passes the public tests. Ultimately, the Gradescope autograder is also going to run `grader.check_all()`, so you should ensure these pass as well (which they should if the doctests above passed).

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [616]:
grader.check_all()

q1 results: All test cases passed!

q2 results: All test cases passed!

q3 results: All test cases passed!

q4 results:
    q4 - 1 result:
        Test case passed!

    q4 - 2 result:
        Trying:
            len(out) == 5000
        Expecting:
            True
        **********************************************************************
        Line 1, in q4 1
        Failed example:
            len(out) == 5000
        Expected:
            True
        Got:
            False

    q4 - 3 result:
        Test case passed!

    q4 - 4 result:
        Test case passed!

    q4 - 5 result:
        Test case passed!

    q4 - 6 result:
        Test case passed!

q5 results: All test cases passed!

q6 results:
    q6 - 1 result:
        Trying:
            isinstance(out_01, str)
        Expecting:
            True
        **********************************************************************
        Line 1, in q6 0
        Failed example:
            isinstance(out_01, str)
        E