<font color=red>For the submission of this assignment, you should create the following folder structure: "FIRSTNAME_LASTNAME/hw01/hw01.ipynb". Then, zip your files and submit on google classroom.</font> 

<img src='folder.png'/>

<font color=red>!!! PLEASE DON'T DELETE ANY EMPTY CELLS !!!</font>


# Linear Regression

In this homework we are going to apply linear regression to two different problems. We'll begin by guiding you through predicting job satisfaction and the desire to be a manager among developers based on survey data. Once that's done, you will model candy preference based on composition and food science properties

In [1]:
!pip install -r requirements.txt

Obtaining jupyter-testing from git+https://github.com/gauravmm/jupyter-testing.git#egg=jupyter-testing (from -r requirements.txt (line 3))
  Updating c:\users\mofeoluwa\src\jupyter-testing clone
Installing collected packages: jupyter-testing
  Found existing installation: jupyter-testing 0.0.2
    Uninstalling jupyter-testing-0.0.2:
      Successfully uninstalled jupyter-testing-0.0.2
  Running setup.py develop for jupyter-testing
Successfully installed jupyter-testing


  Running command git fetch -q --tags
  Running command git reset --hard -q 8c6b703e663a16f77af53a05096f0258274656b2


In [2]:
import csv
import gzip
import math
import hashlib
import numpy as np
from pprint import pprint

from testing.testing import test

## Developers! Developers! Developers!

The data from this question is based on the [2019 StackOverflow Survey](https://insights.stackoverflow.com/survey/2019); accordingly, the subset bundled with this assignment is also released under the Open Database License (ODbL) v1.0.

The data was made by selecting some columns from the original dataset, only retaining rows from people who described themselves as "a developer by profession", and replaced long responses with shorter strings. Lets begin by examining the data.

In [3]:
def read_csv_test(read_csv):
    headers, rows = read_csv()
    test.equal(len(rows), 65679)
    test.equal(len(headers), 26)

    # Print a row:
    pprint(dict(zip(headers, rows[0])))
    
@test
def read_csv(fn="eggs.csv.gz"):
    """read the GZipped CSV data and split it into headers and newlines.
    
    kwargs:
        fn : str -- .csv.gz file to read
    
    returns: Tuple[headers, body] where
      headers : Tuple[str] -- the CSV headers
      body : List[Tuple[str,...]] -- the CSV body
    """
    with gzip.open(fn, 'rt', newline="", encoding='utf-8') as f:
        csvobj = csv.reader(f)
        headers = next(csvobj)
        return headers, [tuple(row) for row in csvobj]

{'Age': '22',
 'CareerSat': 'vs',
 'CodeRevHrs': 'NA',
 'ConvertedComp': '61000',
 'Country': 'United States',
 'Dependents': 'n',
 'DevEnvironVSC': 'y',
 'DevTypeFullStack': 'n',
 'EdLevel': 'bachelors',
 'EduOtherMOOC': 'y',
 'EduOtherSelf': 'y',
 'Extraversion': 'y',
 'GenderIsMan': 'y',
 'Hobbyist': 'n',
 'MgrIdiot': 'very',
 'MgrWant': 'n',
 'OpSys': 'win',
 'OpenSourcer': 'never',
 'OrgSize': '100-499',
 'Respondent': '4',
 'Student': 'n',
 'UndergradMajorIsComputerScience': 'y',
 'UnitTestsProcess': 'n',
 'WorkWeekHrs': '80',
 'YearsCode': '3',
 'YearsCodePro': '0'}
### TESTING read_csv: PASSED 2/2
###



Our task is to predict:

1. if respondents are managers or want to be a manager in the future (`MgrWant` is `y`), and
2. if respondents are satisfied with their career (`CareerSat` is `vs` or `ss`)

based on the remaining rows. We have bolded these rows in the table below.

Before we can use linear regression, we must convert this into numeric data. This is the core challenge of this problem; Here's a table of rows and what they mean:


| Column | Sample | Does/is the respondent... | Type/Values |
| --- |:--- |:--- |:--- |
| **CareerSat** | 'vs' | satisfied with their career? | (`vd`, `sd`, `ne`, `NA`, `ss`, `vs`) -- corresponding to ({very, slightly}, {satisfied, dissatisfied}), neutral, and not applicable |
| **MgrWant** | 'n' | ...want to be a manager? | boolean |
| Age    | '22'   | age | integer     |
| CodeRevHrs | '2' | hours a week spent reviewing code | integer |
| ConvertedComp | '61000' | yearly compensation in 2019 USD | integer |
| Country | 'United States' | lives in country | string _(ignore in regression)_ |
| Dependents | 'n' | ...have children or other dependents. | boolean |
| DevEnvironVSC | 'y' | ...use Visual Studio Code | boolean |
| DevTypeFullStack | 'n' | ...identify as a full-stack developer | boolean |
| EdLevel | 'bachelors' | maximum education level | (`other`, `bachelors`, `masters`, `doctoral`) |
| EduOtherMOOC | 'y' | ...ever taken a Massively Open Online Course | boolean |
| EduOtherSelf | 'y' | ...ever taught themselves a new platform | boolean |
| Extraversion | 'y' | ...prefer in-person meetings to online meetings | boolean |
| GenderIsMan | 'y' | ...male | boolean |
| Hobbyist | 'n' | ...write code as a hobby? | boolean |
| MgrIdiot | 'very' | ...think their manager knows what they are doing? | (`NA`, `not`, `some`, `very`), in order of increasing confidence |
| OpSys | 'win' | which OS do they use? | (`win`, `mac`, `tux`, `NA`), for (Windows, Mac OSX, Linux-like, NA) |
| OpenSourcer | 'Never' | ...contribute to open-source projects? | (`never`, `year`, `month-year`, `month`), in increasing order of frequency |
| OrgSize | '100-499' | number of employees in organization? | (`NA`, `1`, `2-9`, `10-19`, `20-99`, `100-499`, `500-999`, `1,000-4,999`, `5,000-9,999`, `10,000+`) |
| Respondent | '4' | respondent ID from original data | integer _(ignore in regression)_ |
| Student | 'n' | ...currently a student? | boolean |
| UndergradMajorIsComputerScience | 'y' | ...majored in CS? | boolean |
| UnitTestsProcess | 'n' | ...use unit tests in their job? | boolean |
| WorkWeekHrs | '80' | hours a week worked | integer |
| YearsCode | 3 | years since first programming | integer |
| YearsCodePro | 0 | years programming professionally | integer |

# Type conversion

Now for the slow data-cleaning grind that is characteristic of work as a data scientist. We begin by writing type coercion functions: functions that convert each column type into a `float` value for use in linear regression. All input values are `str`.

The column types are:

 - _boolean_ : `y`/`NA`/`n` assigned to `+1.0`/`0.0`/`0.0`
 - _integer_ : convert to `float`, preserving value. `NA` equals `0.0`. 
 - _string_ : not included in regression; we'll use it later
 - CareerSat: Map (`vd`, `sd`, `ne`, `NA`, `ss`, `vs`) to (-2.0, -1.0, 0.0, 0.0, 1.0, 2.0)
 - EdLevel: Map (`other`, `bachelors`, `masters`, `doctoral`) to (0.0, 1.0, 1.5, 2.0)
 - MgrIdiot: Map (`NA`, `not`, `some`, `very`) to (-1.0, -1.0, 0.0, 1.0)
 - OpSys: This is a category variable, we will split this into three columns (one for each possible value) and set 1.0 in the corresponding column. This is called a [one-hot encoding](https://en.wikipedia.org/wiki/One-hot). (Don't write a conversion function in this step.)
 - OpenSourcer : Map (`never`, `year`, `month-year`, `month`) to (0.0, 0.5, 1.0, 2.0)
 - OrgSize: Map each range "$a$-$b$" to the value $ln(a)$. Treat `NA` as `ln(1.0) = 0`. We are converting an exponentially distributed range to a linearly distributed one.

All your conversion functions must throw an exception if you encounter an unexpected value. As an example, we give you the boolean conversion function:

In [4]:
def type_boolean_test(type_boolean):
    test.true(isinstance(type_boolean("y"), float))
    test.equal(type_boolean("y"), 1.0)
    test.equal(type_boolean("n"), 0.0)
    test.exception(lambda: type_boolean("5"))

@test
def type_boolean(c):
    if c == "y": return 1.0
    elif c == "n": return 0.0
    elif c == "NA": return 0.0
    raise ValueError(c)

### TESTING type_boolean: PASSED 4/4
###



Now fill in these functions according to specification:

In [5]:
# Integer
def type_integer_test(type_integer):
    test.true(isinstance(type_integer("5"), float))
    test.equal(type_integer("3"), 3.0)
    test.equal(type_integer("0"), 0.0)
    test.equal(type_integer("-4"), -4.0)
    test.equal(type_integer("NA"), 0.0)
    test.exception(lambda: type_integer("yes"))

@test
def type_integer(c):
    if c == "NA": return 0.0
    elif float(c): return float(c)
    elif c == '0' : return 0.0

### TESTING type_integer: PASSED 6/6
###



In [6]:
# CareerSat
def type_CareerSat_test(type_CareerSat):
    test.true(isinstance(type_CareerSat("vd"), float))
    test.equal(type_CareerSat("sd"), -1.0)
    test.equal(type_CareerSat("ne"), 0.0)
    test.equal(type_CareerSat("NA"), 0.0)
    test.equal(type_CareerSat("ss"), 1.0)
    test.equal(type_CareerSat("vs"), 2.0)
    test.exception(lambda: type_CareerSat("yes"))

@test
def type_CareerSat(c):
    if c == "vd": return -2.0
    elif c == "sd": return -1.0
    elif c == "ne": return 0.0
    elif c == "NA": return 0.0
    elif c == "ss": return 1.0
    elif c == "vs": return 2.0
    raise ValueError(c)

### TESTING type_CareerSat: PASSED 7/7
###



In [14]:
# EdLevel
def type_EdLevel_test(type_EdLevel):
    test.true(isinstance(type_EdLevel("other"), float))
    test.equal(type_EdLevel("bachelors"), 1.0)
    test.equal(type_EdLevel("masters"), 1.5)
    test.equal(type_EdLevel("doctoral"), 2.0)
    test.exception(lambda: type_EdLevel("yes"))

@test
def type_EdLevel(c):
    if c == "other": return 0.0
    elif c == "bachelors": return 1.0
    elif c == "masters": return 1.5
    elif c == "doctoral": return 2.0
    raise ValueError(c)


### TESTING type_EdLevel: PASSED 5/5
###



In [11]:
# MgrIdiot
def type_MgrIdiot_test(type_MgrIdiot):
    test.true(isinstance(type_MgrIdiot("NA"), float))
    test.equal(type_MgrIdiot("not"), -1.0)
    test.equal(type_MgrIdiot("some"), 0.0)
    test.equal(type_MgrIdiot("very"), 1.0)
    test.exception(lambda: type_MgrIdiot("yes"))

@test
def type_MgrIdiot(c):
    if c == "NA": return -1.0
    elif c == "not": return -1.0
    elif c == "some": return 0.0
    elif c == "very": return 1.0
    raise ValueError(c)

### TESTING type_MgrIdiot: PASSED 5/5
###



In [10]:
# OpenSourcer
def type_OpenSourcer_test(type_OpenSourcer):
    test.true(isinstance(type_OpenSourcer("never"), float))
    test.equal(type_OpenSourcer("year"), 0.5)
    test.equal(type_OpenSourcer("month-year"), 1.0)
    test.equal(type_OpenSourcer("month"), 2.0)
    test.exception(lambda: type_OpenSourcer("yes"))

@test
def type_OpenSourcer(c):
    if c == "never": return 0.0
    elif c == "year": return 0.5
    elif c == "month-year": return 1.0
    elif c == "month": return 2.0
    raise ValueError(c)

### TESTING type_OpenSourcer: PASSED 5/5
###



In [12]:
# OrgSize
def type_OrgSize_test(type_OrgSize):
    test.true(isinstance(type_OrgSize("1"), float))
    test.equal(type_OrgSize("NA"), 0)
    test.equal(type_OrgSize("2-9"), 0.6931471805599453)
    test.equal(type_OrgSize("100-499"), 4.605170185988092)
    test.equal(type_OrgSize("10,000+"), 9.210340371976184)
    test.exception(lambda: type_OrgSize("yes"))

@test
def type_OrgSize(c):
    if c == "NA": return np.log(1.0)
    elif c == "1": return np.log(int(c))
    elif '-' in c and ',' not in c:
        d = c.split('-')
        v = np.log(int(d[0]))
        return v
    elif '-' in c and ',' in c:
        c = c.replace(',', '')
        d = c.split('-')
        v = np.log(int(d[0]))
        return v
    elif ',' in c or '+' in c:
        c = c.replace(',', '')
        d = c.split('+')
        v = np.log(int(d[0]))
        return v
    elif ',' in c and '+' not in c:
        c = c.replace(',', '')
        v = np.log(int(c))
        return v
    raise ValueError(c)

### TESTING type_OrgSize: PASSED 6/6
###



Now we deal with OpSys; from the one column in the source, create three columns (called OpSysWin, OpSysMac, and OpSysTux, corresponding to the values win, mac, tux.) For each row, at most one of the cells must be 1.0, and the others must be 0.0. If the value in the cell is NA, BSD, or something else, then all the cells must be 0.0.

This is called a one-hot encoding and is a common way to handle category variables.

In convert_data_stackoverflow, you should:

Encode OpSys as the one-hot encoding discussed above
Remove the Respondent and Country, the two columns not used.
Convert other columns using the appropriate functions above.

In [15]:
def convert_data_stackoverflow_test(convert_data_stackoverflow):
    headers, rows = convert_data_stackoverflow(*read_csv())
    # Correct number of rows:
    test.equal(len(rows), 65679)

    # If this test fails, your headers are incorrect:
    test.equal(set(headers), {'CareerSat', 'MgrWant', 'Age', 'CodeRevHrs', 'ConvertedComp', 'Dependents', 'DevEnvironVSC', 'DevTypeFullStack', 'EdLevel', 'EduOtherMOOC', 'EduOtherSelf', 'Extraversion', 'GenderIsMan', 'Hobbyist', 'MgrIdiot', 'OpSysWin', 'OpSysMac', 'OpSysTux', 'OpenSourcer', 'OrgSize', 'Student', 'UndergradMajorIsComputerScience', 'UnitTestsProcess', 'WorkWeekHrs', 'YearsCode', 'YearsCodePro'})
    # Type check:
    test.true(all(all(isinstance(v, float) for v in r) for r in rows))
    # Operating System columns:
    for row in rows:
        d = dict(zip(headers, row))
        if sorted([d["OpSysWin"], d["OpSysMac"], d["OpSysTux"]]) not in [[.0, .0, 1.], [0.]*3]:
            test.true(False)
            break
    else:
        test.true("There is correctly at most one OpSys* column set to 1.0")
    
    # More direct tests
    test.equal(dict(zip(headers, rows[-2])), {'CareerSat': -1.0, 'MgrWant': 1.0, 'Age': 0.0, 'CodeRevHrs': 5.0, 'ConvertedComp': 588012.0, 'Dependents': 1.0, 'DevEnvironVSC': 1.0, 'DevTypeFullStack': 1.0, 'EdLevel': 1.5, 'EduOtherMOOC': 0.0, 'EduOtherSelf': 0.0, 'Extraversion': 0.0, 'GenderIsMan': 1.0, 'Hobbyist': 1.0, 'MgrIdiot': -1.0, 'OpSysWin': 0.0, 'OpSysMac': 0.0, 'OpSysTux': 1.0, 'OpenSourcer': 0.0, 'OrgSize': 4.605170185988092, 'Student': 1.0, 'UndergradMajorIsComputerScience': 1.0, 'UnitTestsProcess': 1.0, 'WorkWeekHrs': 40.0, 'YearsCode': 10.0, 'YearsCodePro': 8.0})
    test.equal(dict(zip(headers, rows[-1])), {'CareerSat': -1.0, 'MgrWant': 0.0, 'Age': 33.0, 'CodeRevHrs': 0.0, 'ConvertedComp': 22915.0, 'Dependents': 0.0, 'DevEnvironVSC': 0.0, 'DevTypeFullStack': 0.0, 'EdLevel': 1.0, 'EduOtherMOOC': 0.0, 'EduOtherSelf': 0.0, 'Extraversion': 1.0, 'GenderIsMan': 1.0, 'Hobbyist': 0.0, 'MgrIdiot': -1.0, 'OpSysWin': 0.0, 'OpSysMac': 0.0, 'OpSysTux': 1.0, 'OpenSourcer': 2.0, 'OrgSize': 2.995732273553991, 'Student': 0.0, 'UndergradMajorIsComputerScience': 1.0, 'UnitTestsProcess': 1.0, 'WorkWeekHrs': 48.0, 'YearsCode': 9.0, 'YearsCodePro': 5.0})

@test
def convert_data_stackoverflow(headers, data):
    """convert the data into 
    
    args:
        header : List[str] -- the header for each column in the CSV
        data : List[Tuple[str]] -- the CSV data, where each inner list corresponds to a row in the CSV file.
 
    returns: Tuple[headers, body] where
      headers : List[str] -- the new headers, dropping the Country and Respondent headers and expanding 
      body : List[List[str,...]] -- the CSV body
    """
    # one-hot encoding 
    
    head = []
    value = []
    
    for i in range(len(data)):
        e = dict(zip(headers, data[i]))
        d= {'OpSysWin' : 0.0,
            'OpSysMac' : 0.0,
            'OpSysTux' : 0.0
            }
        e.update(d)
    
        j = e['OpSys']
        if j == 'win':
            e['OpSysWin'] = 1.0
        elif j == 'mac':
            e['OpSysMac'] = 1.0
        elif j == 'tux':
            e['OpSysTux'] = 1.0
            
    # type conversion
    
        for n in list(e.keys()):
            if n == 'CareerSat':
                e[n] = (type_CareerSat(e[n]))
            elif n == 'Age':
                e[n] = (type_integer(e[n]))
            elif n == 'CodeRevHrs':
                e[n] = (type_integer(e[n]))
            elif n == 'ConvertedComp':
                e[n] = (type_integer(e[n]))
            elif n == 'WorkWeekHrs':
                e[n] = (type_integer(e[n]))
            elif n == 'YearsCode':
                e[n] = (type_integer(e[n]))
            elif n == 'YearsCodePro':
                e[n] = (type_integer(e[n]))         
            elif n == 'MgrWant':
                e[n] = (type_boolean(e[n]))
            elif n == 'Dependents':
                e[n] = (type_boolean(e[n]))
            elif n == 'DevEnvironVSC':
                e[n] = (type_boolean(e[n]))
            elif n == 'DevTypeFullStack':
                e[n] = (type_boolean(e[n]))
            elif n == 'EduOtherMOOC':
                e[n] = (type_boolean(e[n]))
            elif n == 'EduOtherSelf':
                e[n] = (type_boolean(e[n]))
            elif n == 'Extraversion':
                e[n] = (type_boolean(e[n]))
            elif n == 'GenderIsMan':
                e[n] = (type_boolean(e[n]))
            elif n == 'Hobbyist':
                e[n] = (type_boolean(e[n]))
            elif n == 'Student':
                e[n] = (type_boolean(e[n]))
            elif n == 'UndergradMajorIsComputerScience':
                e[n] = (type_boolean(e[n]))
            elif n == 'UnitTestsProcess':
                e[n] = (type_boolean(e[n]))
            elif n == 'EdLevel':
                e[n] = (type_EdLevel(e[n]))
            elif n == 'MgrIdiot':
                e[n] = (type_MgrIdiot(e[n]))
            elif n == 'OpenSourcer':
                e[n] = (type_OpenSourcer(e[n]))
            elif n == 'OrgSize':
                e[n] = (type_OrgSize(e[n]))

        # removing unwanted columns
            if n =='OpSys' or n == 'Respondent' or n == 'Country':
                e.pop(n)

        value.append(list(e.values()))
    
    head = (list(e.keys()))
    
    return head, value

### TESTING convert_data_stackoverflow: PASSED 6/6
###



### Splitting Data¶

Now we prepare the converted data for regression. In this step, we:

split this into training and validation sets,
convert it to a Numpy ndarray with underlying type np.float32,
split each set into the predicted columns and the feature columns.
We will save the first 20% of the dataset (rounded down) as the validation set and keep the remaining as the training set. (Note that it is common practice to randomize the dataset; this has already been done. Don't shuffle the dataset for this assignment.)

Ensure that the underlying type of the ndarray is np.float32, not the default np.float64. We do not need the added precision of 64-bit floating point numbers for this problem, and using the smaller numbers will speed up computation and reduce the amount of memory we need.

In [16]:
def split_data_test(split_data):
    headers, rows = convert_data_stackoverflow(*read_csv())
    l = len(rows)
    
    val, train = split_data(rows)
    test.equal(len(val), l // 5)
    test.true(isinstance(val, np.ndarray))
    test.equal(val.dtype, np.float32)
    test.equal(len(train), l - (l // 5))
    test.true(isinstance(train, np.ndarray))
    test.equal(train.dtype, np.float32)

@test
def split_data(data):
    """split the data into training and validation sets, and convert them to np.ndarray. (Step 1 and 2 above.)

    args:
        data : List[List[str]] -- the CSV data, where each inner list corresponds to a row in the CSV file.

    returns: Tuple[val, train] where
      val  : np.ndarray[num_val_rows, num_features] -- the first 20% of the dataset (rounded down)
      train : np.ndarray[num_train_rows, num_features] -- the remaining rows from data
    
    Ensure that the underlying type of the output is np.float32, not the default np.float64.
    """
    v = len(data)
    l = math.floor(0.2 * v)
    val = np.array(data[:l], dtype = np.float32)
    train = np.array(data[l:], dtype = np.float32)
    
    return val, train

### TESTING split_data: PASSED 6/6
###



In [17]:
def separate_objective_test(separate_objective):
    headers, rows = convert_data_stackoverflow(*read_csv())
    val, train = split_data(rows)

    for subset in [val, train]:
        subset_headers, subset_features, subset_objectives = separate_objective(headers, subset, ["CareerSat", "MgrWant"])

        test.true(isinstance(subset_objectives, tuple))
        test.equal(len(subset_objectives), 2)
        test.true("CareerSat" not in subset_headers)
        test.true("MgrWant" not in subset_headers)
        test.equal(subset_features.shape[1], 24)

@test
def separate_objective(headers, data, objectives):
    """split the objective columns from the headers and data. (Step 1 and 2 above.)

    args:
        headers    : List[str] -- the headers for the data, used to find the objective columns from the data array
        data       : np.ndarray[num_rows, num_columns] -- the data
        objectives : the columns to extract from the data

    returns: Tuple[o_headers, o_features, o_objectives] where
      o_headers  : List[str] -- a list of headers without the objective columns
      o_features : np.ndarray[num_train_rows, num_features] -- the remaining columns from data. (num_features = num_columns - len(objectives))
      o_objectives : Tuple[np.ndarray[num_train_rows], ...] -- a list of objective columns from the data, each element is a 1-dimensional np.ndarray corresponding to the entry in objectives.
     """
    import pandas as pd
    j = pd.DataFrame(data, columns=headers)
    o_headers = []
    
    for x in headers:
        if x not in objectives:
            o_headers.append(x)
            
    o_features = j[o_headers].values
    
    o_objectives = []
    for i in objectives:
        o_objectives.append(j[i].values)
    
    return o_headers, o_features, tuple(o_objectives)
        

### TESTING separate_objective: PASSED 10/10
###



### Linear Regression

Now you will finally implement a linear regression. As a reminder, linear regression models the data as

$$\mathbf y = \mathbf X\mathbf \beta + \mathbf \epsilon$$

where $\mathbf y$ is a vector of outputs, $\mathbf X$ is also known as the design matrix, $\mathbf \beta$ is a vector of parameters, and $\mathbf \epsilon$ is noise. We will be estimating $\mathbf \beta$ using the [Ordinary Least Squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) approach.

Hints:

 1. Use `np.linalg.solve` to calculate `beta` instead of inverting the matrix, which is [numerically unstable](https://math.stackexchange.com/a/1622654).
 2. You should add `1e-4*np.eye(...)` to the coefficient matrix to prevent singular value errors. Our test cases assume the coefficient `1e-4`, which is **not** equal to `np.exp(-4)`.
 3. Do not include a bias term/constant column.

In [18]:
class LinearRegression():
    """ Perform linear regression and predict the output on unseen examples. 
    
    attributes: 
        beta (np.ndarray) : vector containing parameters for the features
    """

    def train(self, X, y):
        """ Train the linear regression model by computing the estimate of the parameters
        You should store the model parameters in self.beta, overwriting parameters as necessary.

        args: 
            X (np.ndarray[num_examples, num_columns]) : matrix of training data
            y (np.ndarray[num_examples]) : vector of output variables

        return: LinearRegression -- returns itself (for convenience)
        """
        # normal equ = x.t multiplied by x multiplied by the x.t then multiplied by y
        #theta = np.linalg.solve(X.T @ X, X.T @ y)
        
        self.beta = np.zeros(X.shape[1])
        m, n = X.shape
#         A = np.zeros((n,n))
#         b = np.zeros(n)
#         for i in range(m):
#             A += np.outer(X[i], X[i])
#             b += X[i] * y[i]
#         self.beta = np.linalg.solve((A + (0.0001 * np.eye(n,n))), b)
        self.beta = np.linalg.solve((X.T @ X)+ (0.0001 * np.eye(n,n)), X.T @ y) # @ ==  x.dot(y)
        return self

    def predict(self, X_p): 
        """ Use the learned model to predict the output of X_p

        args: 
            X_p (np.ndarray[num_examples, num_columns]) matrix of test/validation data where each row corresponds to an example

        return: 
            (np.ndarray[num_examples]) vector of predicted outputs
        """
        theta = self.beta
        return X_p @ theta # same as X_p @ theta
        

In [19]:
def linear_regression_instance_test(linear_regression_instance):
    lr = linear_regression_instance()

    # If this throws a Singular Matrix error, you did not add the smoothing term:
    # If you get a reference value of [10000.0, 10000.0, 10000.0, 10000.0, 10000.0], then you are applying
    #   smoothing incorrectly. You should not be adding the smoothing term to X.
    test.equal(lr.train(np.zeros((20, 5)), np.ones((20,))).beta.tolist(), [0.0]*5)

    # Basic functionality tests:
    test.equal(lr.train(np.eye(6)*(1-1e-4), np.ones((6,))).beta.round(4).tolist(), [1.0]*6)
    test.equal(lr.train(np.array([[0., 1.], [1., 2.], [2., 3.]]), np.array([1., 2., 3.])).beta.round(4).tolist(), [0.0001, 0.9999])
    
# Don't remove this function; we use it for the auto-grader.
@test
def linear_regression_instance():
    return LinearRegression()

### TESTING linear_regression_instance: PASSED 3/3
###



### Error Functions

One last part to this: linear regression minimizes the mean-squared-error. Write a function that calculates the mean mean-squared-error when given a prediction and a ground-truth vector.

In [20]:
def mean_squared_error_test(mean_squared_error):
    test.equal(mean_squared_error(np.ones(10), np.ones(10)), 0)
    test.equal(mean_squared_error(np.ones(10), np.zeros(10)), 1)

@test
def mean_squared_error(pred, ground_truth):
    """ calculate the mean mean-squared-error between pred and ground_truth
    
    args:
      pred : np.ndarray[num_examples] -- the predictions
      ground_truth : np.ndarray[num_examples] -- the ground truth values
      
    returns: float -- the average mean-squared-error between predictions and ground_truth values.
    """
    mse = ((pred-ground_truth)**2).mean()
    return mse
    
    

### TESTING mean_squared_error: PASSED 2/2
###



### Putting it all together

And finally, lets run the entire pipeline end-to-end. You should put all the functions you have written so far together to:

1. read and split the dataset,
2. train two separate models on the training set, one to predict `MgrWant` and the other to predict `CareerSat`,
3. perform inference on the validation set, and
4. return the mean-squared error for each.

Remember not to include both columns `MgrWant` and `CareerSat` when training models to predict either column. (i.e. when training `MgrWant`, you should not include `CareerSat` and vice-versa.)

In [21]:
def linear_regression_run_test(linear_regression_run):
    mse_mgr, mse_sat = linear_regression_run(*read_csv())
    test.true(np.abs(mse_mgr - 0.07214) < 1e-4)
    test.true(np.abs(mse_sat - 1.29104) < 1e-4)

@test
def linear_regression_run(headers, rows):
    """ Perform linear regression on (headers, rows), and return the MSE on the validation set for both `MgrWant` and `CareerSat`. 

    args: 
        headers : List[str] -- headers from CSV file
        rows : np.ndarray[num_examples, num_columns] -- data from the CSV file
        
    return: Tuple[MSEMgrWant, MSECareerSat], where
        MSEMgrWant : float -- the MSE between the predictions and the ground truth values for the column `MgrWant`.
        MSECareerSat : float -- the MSE between the predictions and the ground truth values for the column `CareerSat`.
    """
    headers, body = convert_data_stackoverflow(headers, rows)
    
    objectives = ['MgrWant', 'CareerSat']
    o_headers, o_features, o_objectives = separate_objective(headers, body, objectives)
    
    val, train = split_data(o_features)
    m,n = val.shape
    
    
    lr_mgr = LinearRegression()
    lr_mgr.train(train, o_objectives[0][m:])
    mgr_pred = lr_mgr.predict(val)
    
    lr_career = LinearRegression()
    lr_career.train(train, o_objectives[1][m:])
    career_pred = lr_career.predict(val)
    
    MSEMgr = mean_squared_error(mgr_pred, o_objectives[0][:m])
    MSECareer = mean_squared_error(career_pred, o_objectives[1][:m])
    
    return MSEMgr, MSECareer

### TESTING linear_regression_run: PASSED 2/2
###



## BONUS SECTION - Your Turn

Now that we've walked through this once with a large dataset, it is your turn to do this. You will be using [FiveThirtyEight's The Ultimate Halloween Candy Power Ranking](https://www.kaggle.com/fivethirtyeight/the-ultimate-halloween-candy-power-ranking/), included (and shuffled) as `candy.csv.gz`. (The dataset is Copyright (c) 2014 ESPN Internet Ventures. Our shuffled version is released under the MIT License.)

From the original documentation, here is a description of the columns:

| Column | Description | type |
| --- |:--- |:--- |
| **`winpercent`** | The overall win percentage according to 269,000 matchups. | float |
| `competitorname` | The bar name | string (don't use this) |
| `chocolate` | Does it contain chocolate? | boolean (`y`, `n`) |
| `fruity` | Is it fruit flavored? | boolean (`y`, `n`) |
| `caramel` | Is there caramel in the candy? | boolean (`y`, `n`) |
| `peanutalmondy` | Does it contain peanuts, peanut butter or almonds? | boolean (`y`, `n`) |
| `nougat` | Does it contain nougat? | boolean (`y`, `n`) |
| `crispedricewafer` | Does it contain crisped rice, wafers, or a cookie component? | boolean (`y`, `n`) |
| `hard` | Is it a hard candy? | boolean (`y`, `n`) |
| `bar` | Is it a candy bar? | boolean (`y`, `n`) |
| `pluribus` | Is it one of many candies in a bag or box? | boolean (`y`, `n`) |
| `sugarpercent` | The percentile of sugar it falls under within the data set. | float |
| `pricepercent` | The unit price percentile compared to the rest of the set. | float |


You must predict `winpercent` using exactly **four** other columns. Use the first 20% of the dataset as the validation set (the dataset has already been shuffled for you). As output, you should provide the names of the columns and the validation of the MSE. Your MSE must be no more than `330`.

You should convert boolean columns using `type_boolean`, and `LinearRegression` to perform the regression. Don't implement a bias term/constant column. Feel free to create new helper functions as necessary.

In [29]:
import pandas as pd
def candy_test(candy):
    headers, mse = candy(*read_csv("candy.csv.gz"))
    #print((headers, mse))
    test.true(len(headers) == 4)
    test.true("winpercent" not in headers)
    test.true(mse < 330.)
    
@test
def candy(headers, data):
    """ predict winpercent using no more than four other columns
    
    args:
        headers : List[str] -- headers read from the csv file
        data : List[List[str]] -- data from the csv file

    returns: Tuple[selected_headers, mse]
        selected_headers : List[str] -- the headers of at most four columns used to train the model
        mse : float -- the mean-squared error when the columns in selected_headers are used to predict `winpercent`
    """
    df_main = pd.DataFrame(data, columns = headers)
    df_main.drop(['competitorname','crispedricewafer', 'caramel','hard','bar','pluribus','peanutyalmondy','nougat'], 1, inplace=True)
    list1  = []
    dd = df_main['chocolate'].values
    for i in dd:
        list1.append(type_boolean(i))
    df_main['chocolate'] = list1
    

    list1  = []
    dd = df_main['fruity'].values
    for i in dd:
        list1.append(type_boolean(i))
    df_main['fruity'] = list1
    
    
    headers = list(df_main.columns)
    body  = df_main.values
    
    #new_head = df[['chocolate','fruity','peanutalmondy','pricepercent']]
    #new_values = new_head.values
    #head = ['chocolate','fruity','peanutalmondy','pricepercent']
    objectives = ['winpercent']
    o_headers, o_features, o_objectives = separate_objective(headers, body, objectives)
    
    o_objectives = np.array(o_objectives, dtype='float32')
    
    val, train = split_data(o_features)
    m,n = val.shape
    
    lr_win = LinearRegression()
    lr_win.train(train, o_objectives[0][m:])
    win_pred = lr_win.predict(val)
    
    MSEwin = mean_squared_error(win_pred, o_objectives[0][:m])
    
    return o_headers, MSEwin

### TESTING candy: PASSED 3/3
###

