# Problem 1: Covariance - Josh
- When given a data matrix, an easy way to tell if any two columns are correlated is to look at a scatter plot of each column against each other column.  
For a warm up, do this:  
Look at the data in DF1 in Lab2 Data.zip.  
Which columns are (pairwise) correlated?  
Figure out how to do this with Pandas, and also how to do this with Seaborn.  

- Compute the covariance matrix of the data. Write the explicit expression for what this is,  
and then use any command you like (e.g., np.cov) to compute the 4 ×4 matrix.  
Explain why the numbers that you get fit with the plots you got.

- The above problem in reverse. Generate a zero-mean multivariate Gaussian random variable in 3 dimensions,  
Z = (X1,X2,X3) so that (X1,X2) and (X1,X3) are uncorrelated, but (X2,X3) are correlated.  
Specifically: choose a covariance matrix that has the above correlations structure, and write this down. Then find a way to generate samples from this Gaussian.  
Choose one of the non-zero covariance terms (Cij , if C denotes your covariance matrix) and plot it vs the estimated covariance term, as the number of samples you use scales.  
The goal is to get a visual representation of how the empirical covariance converges to the true (or family) covariance.

# Problem 2: Outliers - Jackson

Consider the two-dimensional data in DF2 in *Lab2 Data.zip*.  
<b>Look at a scatter plot of the data.</b>  

It contains two points that look like potential outliers.  
<b>Which one is “more” outlying?</b>  
Propose a transformation of the data that makes it clear that the point at (−1,1) is more outlying than the point at (5.5,5),   
even though the latter point is “farther away” from the nearest points.  
Plot the data again after performing this transformation.  
Provide discussion as appropriate to justify your choice of transformation.  
<i>Hint: if y comes from a standard Gaussian in two dimensions  
(i.e., with covariance equal to the two by two identity matrix)</i>

$$Q = \begin{bmatrix} 2 & 1/2 \\ 1/2 & 2\end{bmatrix}$$

What is the covariance matrix of the random variable z = Qy? If you are given z, how would you
create a random Gaussian vector with covariance equal to the identity, using z?

# Problem 3: Popular Names - Jhanvi

The goal of this exercise is for you to get more experience with Pandas, and to get a chance to  
explore a cool data set. Download the file Names.zip from Canvas. This contains the frequency  
of all names that appeared more than 5 times on a social security application from 1880 through 2015.
- Write a program that on input k and XXXX, returns the top k names from year XXXX.  
- Write a program that on input Name returns the frequency for men and women of the name `Name`.
- It could be that names are more diverse now than they were in 1880, so that a name may be relatively the most popular, though its frequency may have been decreasing over the years.  
Modify the above to return the relative frequency. Note that in the next coming lectures we will learn how to quantify diversity using entropy.
- Find all the names that used to be more popular for one gender, but then became more popular for another gender.
- (Optional) Find something cool about this data set.

In [194]:
import pandas as pd
from zipfile import ZipFile
import decimal as d
import seaborn as sns
import csv

# Globals
file_name = 'Names.zip'
column_names = ['Name', 'Gender', 'Count']
sns.set(style="ticks",
        rc={
            "figure.figsize": [12, 7],
            "text.color": "black",
            "axes.labelcolor": "black",
            "axes.edgecolor": "black",
            "xtick.color": "black",
            "ytick.color": "black",
            "axes.facecolor": "#DAF7A6",
            "figure.facecolor": "#DAF7A6"}
        )

# Functions
def read_file_to_df(name, year):
      with ZipFile(file_name, 'r') as zipped:
           file = zipped.open(name)
           df = pd.read_csv(file, sep=",")
           df.columns = column_names
           df.insert(0, 'Year', year)
           return df

def top_names_from_year(top, year):
      year_file_name = 'Names/yob' + str(year) + '.txt'
      df = read_file_to_df(year_file_name, year)
      df.sort_values('Count')
      return df.loc[:(top-1), ('Name', 'Gender', 'Count')]

def name_frequency(name):
      curr_year = 1880
      freq_sum_men = 0
      freq_sum_women = 0
      rel_freq_men = 0
      rel_freq_women = 0
      total = 0

      while curr_year != 2016:
            year_file_name = 'Names/yob' + str(curr_year) + '.txt'
            df = read_file_to_df(year_file_name, curr_year)

            df_men = df.loc[(df.Name == name) & (df.Gender == 'M')]
            df_women = df.loc[(df.Name == name) & (df.Gender == 'F')]

            if len(df_men) != 0:
                  freq_sum_men += df_men.Count.iloc[0]
            if len(df_women) != 0:
                  freq_sum_women += df_women.Count.iloc[0]
            curr_year += 1

      total = freq_sum_men + freq_sum_women

      if total != 0:
            rel_freq_men = (freq_sum_men / total) * 100
            rel_freq_men = d.Decimal(rel_freq_men).quantize(d.Decimal('.1'), rounding=d.ROUND_DOWN)
            rel_freq_women = (freq_sum_women / total) * 100
            rel_freq_women = d.Decimal(rel_freq_women).quantize(d.Decimal('.1'), rounding=d.ROUND_DOWN)

      print(name + " from 1880 to 2015")
      ret_df = pd.DataFrame({'Count': [freq_sum_men, freq_sum_women], 'Relative Frequency': [rel_freq_men, rel_freq_women]}, index=['Men', 'Women'])
      return ret_df

def rel_frequency(name):
      curr_year = 1880
      rows = []
      men_perc = 0
      women_perc = 0

      while curr_year != 2016:
            total = 0
            year_file_name = 'Names/yob' + str(curr_year) + '.txt'
            df = read_file_to_df(year_file_name, curr_year)

            df_men = df.loc[(df.Name == name) & (df.Gender == 'M')]
            df_women = df.loc[(df.Name == name) & (df.Gender == 'F')]
            row = [curr_year, 0, 0, 0, 0]

            if len(df_men) != 0:
                  row[1] = df_men.Count.iloc[0]
                  total += row[1]

            if len(df_women) != 0:
                  row[2] = df_women.Count.iloc[0]
                  total += row[2]

            if total != 0:
                men_perc = (row[1] / total) * 100
                women_perc = (row[2] / total) * 100

            row[3] = d.Decimal(men_perc).quantize(d.Decimal('.1'), rounding=d.ROUND_DOWN)
            row[4] = d.Decimal(women_perc).quantize(d.Decimal('.1'), rounding=d.ROUND_DOWN)

            rows.append(row)
            curr_year += 1

      ret_df = pd.DataFrame(rows, columns=['Year', 'Men', 'Women', '% Mens', '% Womens'])
      # ret_df = ret_df.set_index('Year')
      return ret_df

def open_all_in_one_df():
      curr_year = 1880
      ret_df = []

      while curr_year != 2016:
            year_file_name = 'Names/yob' + str(curr_year) + '.txt'
            temp_df = read_file_to_df(year_file_name, curr_year)
            ret_df.append(temp_df)
            curr_year += 1

      big_df = pd.concat(ret_df)
      return big_df

# Parts
# Question 1: Top k names in year XXXX
k, year = input("Find the top k names in the year XXXX - k year.").split()
top_names_from_year(int(k), int(year))

# Question 2: Most frequent M & F of Name
name_in = input("Find frequency of name among men and women - name")
name_frequency(name_in)

# Question 3: Relative frequency of a name
name_in_rel = input("Find the relative frequency of a name among men and women - name")
df = rel_frequency(name_in_rel)
df = df.astype(float)
print(name_in_rel + " from 1880 to 2015")
display(df)
line_rel = df.plot(x='Year', y=['Men', 'Women'], kind='line', title='Count')
display(line_rel)
line_count = df.plot(x='Year', y=['% Mens', '% Womens'], kind='line', title='Relative Frequency')
display(line_count)

# Question 4: Name popularity shifts from one gender to the other
print('The follow names switched popularity between genders')
gender_switched = []
col = 'Name'
all_names = open_all_in_one_df()
uni_names = pd.concat([all_names[col]]).unique()
#print(uni_names)

for unique_name in uni_names:
      freq_df = rel_frequency(unique_name)
      freq_df['% Mens'] = pd.to_numeric(freq_df['% Mens'])
      freq_df['% Womens'] = pd.to_numeric(freq_df['% Womens'])
      max_men_relfreq = freq_df.loc[freq_df['% Mens'].idxmax()] # find max value in col
      max_men_relfreq = max_men_relfreq['% Mens']

      if max_men_relfreq > 50:
            max_women_relfreq = freq_df.loc[freq_df['% Womens'].idxmax()] # find max value in column
            max_women_relfreq = max_women_relfreq['% Womens']
            if max_women_relfreq > 50:
                  gender_switched.append(unique_name)
                  print(unique_name)

The follow names switched popularity between genders
Emma
Jessie
Emily
Ollie
Augusta
Sophia


KeyboardInterrupt: 

# Problem 4: Starting in Kaggle - Josh

Later in this class, you will be participating in the in-class Kaggle competition made specifically
for this class. In that one, you will be participating on your own. This is a warmup- the more
effort and research you put into this assignment the easier it will be to compete into the real Kaggle
competition that you will need to do soon.
1. Let’s start with our first Kaggle submission in a playground regression competition. Make an
account to Kaggle and find
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
2. Follow the data preprocessing steps from
https://www.kaggle.com/apapiu/house-prices-advanced-regression-techniques/regularized-linear-
models. Then run a ridge regression using α = 0.1. Make a submission of this prediction,
what is the RMSE you get?
(Hint: remember to exponentiate np.expm1(ypred) your predictions).
3. Compare a ridge regression and a lasso regression model. Optimize the alphas using cross
validation. What is the best score you can get from a single ridge regression model and from
a single lasso model?
4. Plot the l0 norm (number of nonzeros) of the coefficients that lasso produces as you vary the
strength of regularization parameter alpha.
5. Add the outputs of your models as features and train a ridge regression on all the features
plus the model outputs (This is called Ensembling and Stacking). Be careful not to overfit.
What score can you get? (We will be discussing ensembling more, later in the class, but you
can start playing with it now).
6. Install XGBoost (Gradient Boosting) and train a gradient boosting regression. What score
can you get just from a single XGB? (you will need to optimize over its parameters). We will
discuss boosting and gradient boosting in more detail later. XGB is a great friend to all good
Kagglers!
7. Do your best to get the more accurate model. Try feature engineering and stacking many
models. You are allowed to use any public tool in python. No non-python tools allowed.
8. (Optional) Read the Kaggle forums, tutorials and Kernels in this competition. This is an
excellent way to learn. Include in your report if you find something in the forums you like, or
if you made your own post or code post, especially if other Kagglers liked or used it afterwards.
2
9. Be sure to read and learn the rules of Kaggle! No sharing of code or data outside the Kaggle
forums. Every student should have their own individual Kaggle account and teams can be
formed in the Kaggle submissions with partners. This is more important for live competitions
of course.
10. As in the real in-class Kaggle competition (which will be next), you will be graded based on
your public score (include that in your report) and also on the creativity of your solution.
In your report (that you will submit as a pdf file), explain what worked and what did
not work. Many creative things will not work, but you will get partial credit for developing
them. We will invite teams with interesting solutions to present them in class.