## Assignment 1 Data Analysis using Pandas

This assignment will contain 13 questions with details as below. The due date is Sunday, November 10th, 23:59 pm. Each late day will result in 20% loss of total points.

The file of 'Daily reports (csse_covid_19_daily_reports)' contains 01-01-2023 (MM-DD-YYYY) daily case report. All timestamps are in UTC (GMT+0). More Description can be found in [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.](https://github.com/CSSEGISandData/COVID-19)

References:

- Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Inf Dis. 20(5):533-534. doi: 10.1016/S1473-3099(20)30120-1


Field/Feature/Column names descriptions are listed as follows

- FIPS: US only. Federal Information Processing Standards code that uniquely identifies counties within the USA.

- Admin2: County name. US only.

- Province_State: Province, state or dependency name.

- Country_Region: Country, region or sovereignty name. The names of locations included on the Website correspond with the official designations used by the U.S. Department of State.

- Last Update: MM/DD/YYYY HH:mm:ss (24 hour format, in UTC).

- Lat and Long: Dot locations on the dashboard. All points (except for Australia) shown on the map are based on geographic centroids, and are not representative of a specific address, building or any location at a spatial scale finer than a province/state. Australian dots are located at the centroid of the largest city in each state.

- Confirmed: Counts include confirmed and probable (where reported).

- Deaths: Counts include confirmed and probable (where reported).

- Recovered: Recovered cases are estimates based on local media reports, and state and local reporting when available, and therefore may be substantially lower than the true number. US state-level recovered cases are from COVID Tracking Project. We stopped to maintain the recovered cases.

- Active: Active cases = total cases - total recovered - total deaths. This value is for reference only after we stopped to report the recovered cases.

- Incident_Rate: Incidence Rate = cases per 100,000 persons.

- Case_Fatality_Ratio (%): Case-Fatality Ratio (%) = 100 * Number recorded deaths / Number cases.

- All cases, deaths, and recoveries reported are based on the date of initial report.


Note: Please download the dataset "01-01-2023.csv" from the moodle to your local path for performing the analysis, as some modification on the original data was done to suit the needs for this assignment.

In [90]:
import pandas as pd
import numpy as np

**Question 1 (2 points)**

Now you need to use ```pandas``` to read the downloaded file from your local path. Print the column names, and also print a general description of it by using ```.describe()``` function.**

In [91]:
### Q1
df = pd.read_csv('data_01-01-2023.csv')
df = df.iloc[:, 1:]

df.head(100)


Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,Afghanistan,2023-01-02 04:20:57,33.93911,67.709953,207616,7849,,,Afghanistan,533.328662,3.780537
1,,,Albania,2023-01-02 04:20:57,41.15330,20.168300,333811,3595,,,Albania,11599.520467,1.076957
2,,,Algeria,2023-01-02 04:20:57,28.03390,1.659600,271229,6881,,,Algeria,618.523486,2.536971
3,,,Andorra,2023-01-02 04:20:57,42.50630,1.521800,47751,165,,,Andorra,61801.591924,0.345543
4,,,Angola,2023-01-02 04:20:57,-11.20270,17.873900,105095,1930,,,Angola,319.765542,1.836434
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,,Antofagasta,Chile,2023-01-02 04:20:57,-23.65090,-70.397500,178887,1924,,,"Antofagasta, Chile",29444.771815,1.075539
96,,Araucania,Chile,2023-01-02 04:20:57,-38.94890,-72.331100,286388,2851,,,"Araucania, Chile",29918.597946,0.995503
97,,Arica y Parinacota,Chile,2023-01-02 04:20:57,-18.59400,-69.478500,78369,881,,,"Arica y Parinacota, Chile",34666.118159,1.124169
98,,Atacama,Chile,2023-01-02 04:20:57,-27.56610,-70.050300,112012,699,,,"Atacama, Chile",38765.989257,0.624040


**Question 2  (4 points)**

Meanwhile, the data contains a few errors that need to be resolved:


- the ```Long``` column is mistakenly encoded as ```Long_```
- the ```Recovered``` column contains mostly missing values and needs to be deleted
- the ```Active``` column and ```Recovered``` column contain mostly missing values and needs to be deleted
- the ```Case_Fatality_Ratio``` column should be rounded to a floating-point number with 4 decimal places.

**In the following questions, please work on the updated dataframe.**

In [92]:
### Q2
df.rename(columns={'Long_': 'Long'}, inplace=True)

df.drop(["Recovered","Active"], axis=1, inplace=True)

df['Case_Fatality_Ratio'] = df['Case_Fatality_Ratio'].round(4)



In [93]:
df

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,Afghanistan,2023-01-02 04:20:57,33.939110,67.709953,207616,7849,Afghanistan,533.328662,3.7805
1,,,Albania,2023-01-02 04:20:57,41.153300,20.168300,333811,3595,Albania,11599.520467,1.0770
2,,,Algeria,2023-01-02 04:20:57,28.033900,1.659600,271229,6881,Algeria,618.523486,2.5370
3,,,Andorra,2023-01-02 04:20:57,42.506300,1.521800,47751,165,Andorra,61801.591924,0.3455
4,,,Angola,2023-01-02 04:20:57,-11.202700,17.873900,105095,1930,Angola,319.765542,1.8364
...,...,...,...,...,...,...,...,...,...,...,...
4011,,,West Bank and Gaza,2023-01-02 04:20:57,31.952200,35.233200,703228,5708,West Bank and Gaza,13784.956961,0.8117
4012,,,Winter Olympics 2022,2023-01-02 04:20:57,39.904200,116.407400,535,0,Winter Olympics 2022,,0.0000
4013,,,Yemen,2023-01-02 04:20:57,15.552727,48.516388,11945,2159,Yemen,40.048994,18.0745
4014,,,Zambia,2023-01-02 04:20:57,-13.133897,27.849332,334629,4024,Zambia,1820.223025,1.2025


**Grading: students will lose 1 point for each incorrect command. There is a typo in the updated question: if studetns updated the "Incident_Rate" column (which was required in the previous version of the assignment), it will not affect the subsequent questions and the grading.**

**Question 3  (2 points)**

The column ```Last_Update``` involves some timestamps that are not in the year of 2023. Find them out and delete those rows.

**The updated dataframe should have only rows with timestamp in 2023. Please work on this updated dataframe in following questions.**

Hint: use value_counts() to count unique values first.

In [94]:
### Q3
df.Last_Update.value_counts()

Last_Update
2023-01-02 04:20:57    4002
2020-12-21 13:27:30       5
2022-11-22 23:21:06       2
2020-08-04 02:27:56       2
2022-10-21 23:21:56       1
2022-09-12 23:21:04       1
2020-08-07 22:34:20       1
2021-10-10 23:21:42       1
2021-07-31 23:21:38       1
Name: count, dtype: int64

In [95]:
Last_Update_2023 = df.Last_Update.value_counts().index[0]
Last_Update_2023

'2023-01-02 04:20:57'

In [96]:
df45 = df[df["Last_Update"] == Last_Update_2023]
df45["Last_Update"].value_counts()

Last_Update
2023-01-02 04:20:57    4002
Name: count, dtype: int64

In [97]:
mask = df["Last_Update"] == Last_Update_2023
df = df[mask]

In [98]:
df.Last_Update.value_counts()

Last_Update
2023-01-02 04:20:57    4002
Name: count, dtype: int64

**Question 4  (2 points)**

There are two provinces/states that have the same latitude (```Lat```) 52.939900. Print out these two provinces/states.

In [99]:
Lat = df["Lat"].value_counts().index[0]

In [100]:
df[df["Lat"]==Lat]["Province_State"].values

array(['Quebec', 'Saskatchewan'], dtype=object)

In [101]:
### Q4
df[df.Lat== 52.939900].Province_State

89          Quebec
91    Saskatchewan
Name: Province_State, dtype: object

**Question 5  (2 points)**

Calculate and display the average ```Confirmed``` cases across all regions (report one overall average).

Calculate and display the median Deaths for U.S. counties (report one overall median).

In [102]:
### Q5
df['Confirmed'].mean()

165106.29985007495

In [103]:
df[df["Country_Region"] == "US"]["Deaths"].median()

103.0

In [104]:
df.loc[df['Country_Region']=='US', 'Deaths'].median()

103.0

**Question 6 (2 points)**

Show the difference of average ```Deaths``` number between Alabama in US and Wyoming in US .

In [105]:
### Q6
df[(df['Country_Region']=='US') & (df['Province_State']=='Alabama')].Deaths.mean() - df[(df['Country_Region']=='US') & (df['Province_State']=='Wyoming')].Deaths.mean()

227.92412935323387

In [106]:
al = df[(df["Country_Region"].str.contains("US")) & df["Province_State"].str.contains("Alabama")]["Deaths"].mean()
al

309.5074626865672

In [107]:
wy = df[(df["Country_Region"].str.contains("US")) & df["Province_State"].str.contains("Wyoming")]["Deaths"].mean()
wy

81.58333333333333

In [108]:
al-wy

227.92412935323387

**Question 7 (4 points)**

Create a subset of the DataFrame containing only samples collected in the U.S. where ```Admin2``` is not "Unassigned" and and ```Confirmed``` is a positive number.

Extract the State Name: Using the values in the Combined_Key column, create a new column called State_recovered that contains only the name of the province, state, or dependency. Exclude any county names and country/region information. For example, when Combined_Key is "Autauga, Alabama, US", the extracted state name should be "Alabama". You cannot just copy the value from the column Province_State. 

**Note: In the following questions, use this U.S. subset DataFrame.**

In [123]:
df12 = df[(df.Country_Region=='US') & (df.Admin2 != 'Unassigned') & (df.Confirmed > 0)].copy()
names = df12["Combined_Key"].apply(lambda x: str(x).split(",")[-2].strip())
df12["State_recovered"] = names

In [110]:
### Q7
df1 = df[(df.Country_Region=='US') & (df.Admin2 != 'Unassigned') & (df.Confirmed > 0)].copy()
df1['State_recovered'] = [','.join(x[-2:-1]).strip() for x in df1.Combined_Key.str.split(',')]
df1

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio,State_recovered
678,Autauga,Alabama,US,2023-01-02 04:20:57,32.539527,-86.644082,18961,230,"Autauga, Alabama, US",33938.319999,1.2130,Alabama
679,Baldwin,Alabama,US,2023-01-02 04:20:57,30.727750,-87.722071,67496,719,"Baldwin, Alabama, US",30235.537597,1.0652,Alabama
680,Barbour,Alabama,US,2023-01-02 04:20:57,31.868263,-85.387129,7027,103,"Barbour, Alabama, US",28465.527019,1.4658,Alabama
681,Bibb,Alabama,US,2023-01-02 04:20:57,32.996421,-87.125115,7692,108,"Bibb, Alabama, US",34348.486202,1.4041,Alabama
682,Blount,Alabama,US,2023-01-02 04:20:57,33.982109,-86.567906,17731,260,"Blount, Alabama, US",30662.677688,1.4664,Alabama
...,...,...,...,...,...,...,...,...,...,...,...,...
3951,Sweetwater,Wyoming,US,2023-01-02 04:20:57,41.659439,-108.882788,12410,136,"Sweetwater, Wyoming, US",29308.268191,1.0959,Wyoming
3952,Teton,Wyoming,US,2023-01-02 04:20:57,43.935225,-110.589080,12010,16,"Teton, Wyoming, US",51184.793727,0.1332,Wyoming
3953,Uinta,Wyoming,US,2023-01-02 04:20:57,41.287818,-110.547578,6305,43,"Uinta, Wyoming, US",31172.747948,0.6820,Wyoming
3955,Washakie,Wyoming,US,2023-01-02 04:20:57,43.904516,-107.680187,2722,47,"Washakie, Wyoming, US",34875.080077,1.7267,Wyoming


In [124]:
(df12["State_recovered"] == df1["State_recovered"]).value_counts()

State_recovered
True    3213
Name: count, dtype: int64

In [111]:
df1.shape

(3213, 12)

**Grading: if a dataframe with different number of rows was produced, the student could get paritial grade depending on if the solution makes some sense.**

**Question 8 (2 points)**

Compute the correlation between ```Confirmed, Deaths, Incident_Rate, Case_Fatality_Ratio```. What do you observe?

In [112]:
### Q8
df1[['Confirmed','Deaths','Incident_Rate', 'Case_Fatality_Ratio']].corr()

Unnamed: 0,Confirmed,Deaths,Incident_Rate,Case_Fatality_Ratio
Confirmed,1.0,0.967878,0.053755,-0.13491
Deaths,0.967878,1.0,0.048418,-0.07078
Incident_Rate,0.053755,0.048418,1.0,-0.256702
Case_Fatality_Ratio,-0.13491,-0.07078,-0.256702,1.0


**grading: as long as long the student uses corr() function with the four variables, the solution is correct.**

**Question 10 (2 points)**
Create a new column ```Case_Fatality_Ratio_short``` to extract and store (**not round**) the first three digits of the original values in the column ```Case_Fatality_Ratio```. For example, when the original value is 1.213016, we should extract 1.21.

Create a new column ```Case_Fatality_Ratio_calculated``` and compute Case-Fatality Ratio(%) by yourself. Extract and Store the first three digits of the computed values as well.

Note that Case-Fatality Ratio(%) = 100 * Number recorded deaths / Number cases.

In [141]:
short = df12["Case_Fatality_Ratio"].apply(lambda x: float(str(x)[:4]))
short

678     1.21
679     1.06
680     1.46
681     1.40
682     1.46
        ... 
3951    1.09
3952    0.13
3953    0.68
3955    1.72
3956    1.17
Name: Case_Fatality_Ratio, Length: 3213, dtype: float64

In [157]:
cacl = (df12["Deaths"] / df12["Confirmed"]) * 100
cacl.apply(lambda x: float(str(x)[:4])).dropna()

678     1.21
679     1.06
680     1.46
681     1.40
682     1.46
        ... 
3951    1.09
3952    0.13
3953    0.68
3955    1.72
3956    1.17
Length: 3213, dtype: float64

In [144]:
(cacl == short).value_counts()

False    3114
True       99
Name: count, dtype: int64

In [155]:
### Q10
df1['Case_Fatality_Ratio_short'] = df1.Case_Fatality_Ratio.astype(str).str[:4].astype(float)
df1['Case_Fatality_Ratio_calculated'] = (100 * df1.Deaths/df.Confirmed).astype(str).str[:4].astype(float)
df1

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio,State_recovered,Case_Fatality_Ratio_short,Case_Fatality_Ratio_calculated
678,Autauga,Alabama,US,2023-01-02 04:20:57,32.539527,-86.644082,18961,230,"Autauga, Alabama, US",33938.319999,1.2130,Alabama,1.21,1.21
679,Baldwin,Alabama,US,2023-01-02 04:20:57,30.727750,-87.722071,67496,719,"Baldwin, Alabama, US",30235.537597,1.0652,Alabama,1.06,1.06
680,Barbour,Alabama,US,2023-01-02 04:20:57,31.868263,-85.387129,7027,103,"Barbour, Alabama, US",28465.527019,1.4658,Alabama,1.46,1.46
681,Bibb,Alabama,US,2023-01-02 04:20:57,32.996421,-87.125115,7692,108,"Bibb, Alabama, US",34348.486202,1.4041,Alabama,1.40,1.40
682,Blount,Alabama,US,2023-01-02 04:20:57,33.982109,-86.567906,17731,260,"Blount, Alabama, US",30662.677688,1.4664,Alabama,1.46,1.46
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3951,Sweetwater,Wyoming,US,2023-01-02 04:20:57,41.659439,-108.882788,12410,136,"Sweetwater, Wyoming, US",29308.268191,1.0959,Wyoming,1.09,1.09
3952,Teton,Wyoming,US,2023-01-02 04:20:57,43.935225,-110.589080,12010,16,"Teton, Wyoming, US",51184.793727,0.1332,Wyoming,0.13,0.13
3953,Uinta,Wyoming,US,2023-01-02 04:20:57,41.287818,-110.547578,6305,43,"Uinta, Wyoming, US",31172.747948,0.6820,Wyoming,0.68,0.68
3955,Washakie,Wyoming,US,2023-01-02 04:20:57,43.904516,-107.680187,2722,47,"Washakie, Wyoming, US",34875.080077,1.7267,Wyoming,1.72,1.72


In [147]:
(short == df1['Case_Fatality_Ratio_short']).value_counts()

True    3213
Name: count, dtype: int64

In [158]:
(cacl == df1['Case_Fatality_Ratio_calculated']).value_counts()

False    3114
True       99
Name: count, dtype: int64

**Question 11 (2 points)**

Find and report the number of samples when the ```Case_Fatality_Ratio_short``` is not equal to```Case_Fatality_Ratio_calculated``` and save them into a new dataframe. Remember to drop the missing values appeared in these two columns, before find the subsample.

**In the following questions, use the new dataframe.**

In [114]:
### Q11
df2 = df1[ df1.Case_Fatality_Ratio_short != df1.Case_Fatality_Ratio_calculated].dropna(subset = ['Case_Fatality_Ratio_short','Case_Fatality_Ratio_calculated' ])

In [115]:
df2.shape

(209, 14)

**Grading: If students did not use the updated DataFrame in Q12 and Q13, they may still arrive at the correct value in the "Reject" category. In such cases, partial credit will be awarded; this will incur a 2-points reduction.**

**Question 12 (4 points)**

We define a new concept, acceptable percentage error, to measure the magnitude of error. It is calculated as the absolute percentage difference between the ```Case_Fatality_Ratio_calculated``` and ```Case_Fatality_Ratio_short```, using the formula:
![image.png](attachment:image.png)

Compute the acceptable percentage error (rounded to three decimal places) and add it as a new column to the dataframe.

Group this continuous acceptable percentage error into the following discrete bins: [0, 0.5], (0.5, 1], (1, 10], and (10, 50]. Note that 0 is included in the first bin.

Use the value_counts() method to check the distribution of samples across these bins.

Compute this acceptable percentage error, add it as a new column of the data frame, and group this continuous acceptable percentage error into discrete bins ([0,0.5], (0.5,1], (1,10], (10,50]) to generate a new categorical object. 


In [116]:
### Q12
df2['acceptable_percentage_error'] = 100 * np.abs((df2.Case_Fatality_Ratio_short - df2.Case_Fatality_Ratio_calculated)/ df2.Case_Fatality_Ratio_calculated)
df2['acceptable_percentage_error'] = df2['acceptable_percentage_error'].round(3)
df2

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio,State_recovered,Case_Fatality_Ratio_short,Case_Fatality_Ratio_calculated,acceptable_percentage_error
743,Walker,Alabama,US,2023-01-02 04:20:57,33.802705,-87.300272,22885,476,"Walker, Alabama, US",36027.455487,2.0800,Alabama,2.08,2.07,0.483
772,Southeast Fairbanks,Alaska,US,2023-01-02 04:20:57,63.876921,-143.212764,2442,21,"Southeast Fairbanks, Alaska, US",35427.245031,0.8600,Alaska,0.86,0.85,1.176
875,Butte,California,US,2023-01-02 04:20:57,39.667278,-121.600525,51007,474,"Butte, California, US",23183.050012,0.9328,California,0.93,0.92,1.087
896,Modoc,California,US,2023-01-02 04:20:57,41.589656,-120.724482,1295,11,"Modoc, California, US",14636.353354,0.8501,California,0.85,0.84,1.190
900,Nevada,California,US,2023-01-02 04:20:57,39.303948,-120.762728,22004,132,"Nevada, California, US",22034.985715,0.6005,California,0.60,0.59,1.695
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3343,Williamson,Tennessee,US,2023-01-02 04:20:57,35.890992,-86.892819,74468,438,"Williamson, Tennessee, US",31090.716910,0.5909,Tennessee,0.59,0.58,1.724
3507,Medina,Texas,US,2023-01-02 04:20:57,29.355730,-99.110303,11932,210,"Medina, Texas, US",23131.203474,1.7600,Texas,1.76,1.75,0.571
3522,Nueces,Texas,US,2023-01-02 04:20:57,27.736286,-97.543329,100753,1336,"Nueces, Texas, US",32533.798517,1.3413,Texas,1.34,1.32,1.515
3528,Parker,Texas,US,2023-01-02 04:20:57,32.777572,-97.805006,39587,479,"Parker, Texas, US",27706.854799,1.2100,Texas,1.21,1.20,0.833


In [117]:
df2['acceptable_percentage_error'].describe()

count    209.000000
mean       1.692096
std        3.735055
min        0.433000
25%        0.662000
50%        0.820000
75%        1.176000
max       29.565000
Name: acceptable_percentage_error, dtype: float64

In [161]:
bins = [0, 0.5, 1, 10, 50]
labels = ('[0, 0.5]', '(0.5, 1]', '(1, 10]', '(10, 50]')
df2['acceptable_percentage_error_bins'] = pd.cut(df2['acceptable_percentage_error'], bins=bins, labels=labels, include_lowest=True)

df2['acceptable_percentage_error_bins'].value_counts()

acceptable_percentage_error_bins
(0.5, 1]    131
(1, 10]      62
[0, 0.5]     10
(10, 50]      6
Name: count, dtype: int64

In [119]:
df2

Unnamed: 0,Admin2,Province_State,Country_Region,Last_Update,Lat,Long,Confirmed,Deaths,Combined_Key,Incident_Rate,Case_Fatality_Ratio,State_recovered,Case_Fatality_Ratio_short,Case_Fatality_Ratio_calculated,acceptable_percentage_error,acceptable_percentage_error_bins
743,Walker,Alabama,US,2023-01-02 04:20:57,33.802705,-87.300272,22885,476,"Walker, Alabama, US",36027.455487,2.0800,Alabama,2.08,2.07,0.483,"[0, 0.5]"
772,Southeast Fairbanks,Alaska,US,2023-01-02 04:20:57,63.876921,-143.212764,2442,21,"Southeast Fairbanks, Alaska, US",35427.245031,0.8600,Alaska,0.86,0.85,1.176,"(1, 10]"
875,Butte,California,US,2023-01-02 04:20:57,39.667278,-121.600525,51007,474,"Butte, California, US",23183.050012,0.9328,California,0.93,0.92,1.087,"(1, 10]"
896,Modoc,California,US,2023-01-02 04:20:57,41.589656,-120.724482,1295,11,"Modoc, California, US",14636.353354,0.8501,California,0.85,0.84,1.190,"(1, 10]"
900,Nevada,California,US,2023-01-02 04:20:57,39.303948,-120.762728,22004,132,"Nevada, California, US",22034.985715,0.6005,California,0.60,0.59,1.695,"(1, 10]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3343,Williamson,Tennessee,US,2023-01-02 04:20:57,35.890992,-86.892819,74468,438,"Williamson, Tennessee, US",31090.716910,0.5909,Tennessee,0.59,0.58,1.724,"(1, 10]"
3507,Medina,Texas,US,2023-01-02 04:20:57,29.355730,-99.110303,11932,210,"Medina, Texas, US",23131.203474,1.7600,Texas,1.76,1.75,0.571,"(0.5, 1]"
3522,Nueces,Texas,US,2023-01-02 04:20:57,27.736286,-97.543329,100753,1336,"Nueces, Texas, US",32533.798517,1.3413,Texas,1.34,1.32,1.515,"(1, 10]"
3528,Parker,Texas,US,2023-01-02 04:20:57,32.777572,-97.805006,39587,479,"Parker, Texas, US",27706.854799,1.2100,Texas,1.21,1.20,0.833,"(0.5, 1]"


**Question 13 (2 points)**

Use ```map()``` method to perform element-wise transformation on the generated categorical object and create a new series, according to the following rules:

- if error is in range [0, 0.5] or (0.5, 1], transform as 'Accept'
- if error is in range (1, 10] or (10, 50], transform as 'Reject'

Use ```value_counts()``` to check the counts for these two types.

In [162]:
### Q13
error_to_quality = {'[0, 0.5]': 'Accept', '(0.5, 1]': 'Accept', '(1, 10]':'Reject', '(10, 50]':'Reject'}

df2['acceptable_percentage_error_bins'].astype(str).map(error_to_quality)

df2['acceptable_percentage_error_bins'].astype(str).map(lambda x: error_to_quality[x]).value_counts()

acceptable_percentage_error_bins
Accept    141
Reject     68
Name: count, dtype: int64