# <span style="color:lightblue"> Lecture 12: Application 2 - Random Assignment </span>

<font size = "5">



# <span style="color:lightblue"> I. Import Libraries and Data </span>


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
carfeatures = pd.read_csv("data_raw/features.csv")

# <span style="color:lightblue"> I. Random Assignment </span>

<font size = "5">

Random assignment is crucial for scientific progress ...

- The basis for medical trials
- Also used in engineering, the natural sciences and <br>
  social sciences (economics, political science, etc.)


In [None]:
# "list_status" is a list with "treatment/control" arms
# "prop_status" is the proportion in the treatment and control arms
# "size_dataset" is how many rows are contained

list_status  = ["Treatment","Control"]
prop_status  = [0.4,0.6]
size_dataset = len(carfeatures)

<font size = "5">
Random assignment


In [None]:
# The "np.random.choice" will create a vector with the status
# We will save this to a column in "carfeatures"
# Note: (i) We can always split the arguments of a function in multiple lines
#           to make it easier to read

carfeatures["status"] = np.random.choice(list_status,
                                         size = size_dataset,
                                         p = prop_status)

display(carfeatures)

<font size = "5">

Compute frequencies by status

In [None]:
# The command "pd.crosstab" computes frequencies
# If we add the option "normalize" it will compute proportions
# Note: The default assignment is done randomly without replacement
#       which means that the proportions are approximately the same   
#       (but not equal) to "prop_status"

frequency_table   = pd.crosstab(index = carfeatures["status"], columns = "Frequency")
proportions_table = pd.crosstab(index = carfeatures["status"],
                                columns = "Frequency",
                                normalize = True)

display(frequency_table)
display(proportions_table)


<font size = "5">

Query with string conditions

In [None]:
# When you have queries for text variables, it's important
# to use outer ' ' single quotations
# and inner double quotations.

data_treated = carfeatures.query('status == "Treatment" ')
data_control = carfeatures.query('status == "Control" ')

<font size = "5">

Treated/control should be similar

- This is the key principle of random assignment
- We can check the summary statistics

In [None]:
# The count is different because we assigned different proportions
# All other sumary statistics are approximately the same
# They are not identical because the assignment is random

display(data_treated.describe())
display(data_control.describe())

## <span style="color:lightblue"> III. Quiz Structure </span>

<font size = "5">

The day of the quiz I will ...
- Provide a dataset with information
- Give more specific instructions.
- Below, you will see the type of questions that will be asked.
- The idea is for you to apply known concepts to new data
- You have 50 minutes to complete the assignment

Questions

(exact wording may change in quiz, but exercise will be very similar)


<font size = "5">

(a) Create a function and apply it to a column

- Check Lecture 8 for how to define a function
- The function will have if/else statements and output a string
- You will use ".apply()" to create a new variable in the dataset <br>
(see Lecture 9)

In [None]:
data  = pd.DataFrame([])
data["mpg"] = [4,4,3,6,7,6,7,8]

def linearfn(x):
    if x >= 5:
        status = "pass"
    else:
        status = "fail"
    return (status)
    
#status1 = linearfn( x = 60 )

data["linear"] = data["mpg"].apply(linearfn)
#or
carfeatures["linear"] = carfeatures["cylinders"].apply(linearfn)

<font size = "5">

(b) Use queries + global variables

- You will be asked to compute certain summary statistics <br>
(mean, median, etc)
- The query will have multiple conditions
- Then subset a dataset that meets certain conditions
- See Lecture 10 for more details

In [None]:
#assignment 6 (d)

# data -- datasetName
#global variable medianvalue
medianvalue = data["prop_urbanpopulation"].median()

data_median = data.query("prop_urbanpopulation > @medianvalue")
#query is used to subset
display(data_median)

In [None]:
data_mpg_cylinders = carfeatures.query("(mpg >= 25) & (cylinders == 8)")

#global variable
threshold = 25
data_varthreshold_mpg = carfeatures.query("mpg >= @threshold")

<font size = "5">

(c) Use sorting + ".iloc[]"

- Extract the observations with the largest values of a column
- See Lecture 10 for details

In [None]:
#read data
wdi_data = pd.read_csv("data/wdi_2020.csv")

datasorted = wdi_data.sort_values(by = "prop_urbanpopulation", ascending= False)

# Extract rows 0 to 5 前五排
display(datasorted.iloc[0:5,:])

#ascending -- 最小的
#ascending False (descending) --最大的
display(datasorted.iloc[0:1,:])

<font size = "5">

(d) Split a dataset into subsets

- You will be asked to randomly assign a status to each row
- Split the data into separate datasets using ".query()"
- This will closely follow the material in Lecture 12 (this one)
- You will need this result to answer questions (e), (f)


In [None]:
list_status  = ["Treatment","Control"]
prop_status  = [0.4,0.6]
size_dataset = len(carfeatures)

carfeatures["status"] = np.random.choice(list_status,
                                         size = size_dataset,
                                         p = prop_status)

data_treated = carfeatures.query('status == "Treatment" ')
data_control = carfeatures.query('status == "Control" ')

display(data_treated)
display(data_control)

<font size = "5">

(e) Create a function with four inputs $f(y,x,b0,b1)$

- Start by using "def" to define the function
- The function will include arithmetic operations (Lecture 3) <br>
and summary statistics for pandas (mean, std, min, max, etc.)
- You will be asked to test different values of $(y,x,b0,b1)$
- You will get $y$ and $x$ from the two datasets in part (d)
- Note: You will **not** be required to use the "statsmodels" library


In [None]:
def arithmeticfn (y,x,b0,b1):
    F = y/x * b0 *b1
    return(np.median(F))

F1 = arithmeticfn(y = carfeatures["mpg"], x = carfeatures["cylinders"], b0 = 3, b1 = 4)

print (F1)

<font size = "5">

(f) Create two overlapping histogram plots

- You will use a variable from the two datasets in (d)
- You need to use the "alpha" option to make the graphs semitransparent
- You will need to add a legend, label the axes, and the title
- Note: The goal of this question is to illustrate that random <br>
assignment produces very similar distributions between two groups

In [None]:
car_unique = pd.unique(carfeatures["status"])

for value in car_unique:
    df = carfeatures.query("status== @value")

    plt.hist(df["mpg"], alpha = 0.5)
    plt.title("variable")
    plt.xlabel("status")
    plt.ylabel("mpg")

plt.legend(labels = car_unique)