# DSC 80: Lab 01

### Due Date: Monday October 7, Noon (12:00 PM)

## Instructions
Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding work will be developed in an accompanying `lab01.py` file, that will be imported into the current notebook.

Labs and programming assignments will be graded in (at most) two ways:
1. The functions and classes in the accompanying python file will be tested (a la DSC 20),
2. The notebook will be graded (for graphs and free response questions).

**Do not change the function names in the `*.py` file**
- The functions in the `*.py` file are how your assignment is graded, and they are graded by their name. The dictionary at the end of the file (`GRADED FUNCTIONS`) contains the "grading list". The final function in the file allows your doctests to check that all the necessary functions exist.
- If you changed something you weren't supposed to, just use git to revert!

**Tips for working in the Notebook**:
- The notebooks serve to present you the questions and give you a place to present your results for later review.
- The notebook on *lab assignments* are not graded (only the `.py` file).
- Notebooks for PAs will serve as a final report for the assignment, and contain conclusions and answers to open ended questions that are graded.
- The notebook serves as a nice environment for 'pre-development' and experimentation before designing your function in your `.py` file.

**Tips for developing in the .py file**:
- Do not change the function names in the starter code; grading is done using these function names.
- Do not change the docstrings in the functions. These are there to tell you if your work is on the right track!
- You are encouraged to write your own additional functions to solve the lab! 
    - Developing in python usually consists of larger files, with many short functions.
    - You may write your other functions in an additional `.py` file that you import in `lab01.py` (much like we do in the notebook).
- Always document your code!

### Importing code from `lab**.py`

* We import our `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab**.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab**.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab**.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab**` merely import the existing compiled python.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import lab01 as lab

In [3]:
import os
import pandas as pd
import numpy as np

## Python Basics

---
**Question 0 (EXAMPLE):**

Write a function that takes in a possibly empty list of integers and:
* Returns `True` if there exist two adjacent list elements that are consecutive integers.
* Otherwise, returns `False`.

For example, because `9` is next to `8`:
```
>>> lab.consecutive_ints([5,3,6,4,9,8])
True
```
Whereas:
```
>>> lab.consecutive_ints([1,3,5,7,9])
False
```

*Note*: This question is done for you, to demonstrate a completed homework problem.

In [4]:
# Develop your code here (or in an IDE) if you'd like.
# Though only code in lab01.py will be graded!

In [5]:
# Add more cells if you'd like!

Test your code in two ways:
1. Run the cell below to test your code. You should also copy the cell and change the input to test further (i.e. write your own doctests)! Does it work for corner cases? Real-world data is **very messy** and you should expect your data processing code to break without thorough testing!
2. Run doctests on `lab01.py` by running the following command on the commandline:
```
python -m doctest lab01.py
```
If the doctests pass, then there should be *no* output.

In [6]:
# test your code!
lab.consecutive_ints([1,3,2,4])

True

In [7]:
lab.consecutive_ints([0])

False

In [8]:
lab.consecutive_ints([])

False

---
**Question 1 (median):**

Write a function called *median* that takes a non-empty list of numbers, returning the median element of the list. If the list has even length, it should return the mean of the two elements in the middle. Do not use any imported libraries for this question; you may use any built-in function.


In [9]:
def median(nums):
    """
    median takes a non-empty list of numbers,
    returning the median element of the list.
    If the list has even length, it should return
    the mean of the two elements in the middle.

    :param nums: a non-empty list of numbers.
    :returns: the median of the list.
    
    :Example:
    >>> median([6, 5, 4, 3, 2]) == 4
    True
    >>> median([50, 20, 15, 40]) == 30
    True
    >>> median([1, 2, 3, 4]) == 2.5
    True
    """
    sortedNums = sorted(nums)
    if len(sortedNums) % 2 == 0:
        return (sortedNums[len(sortedNums) // 2] + sortedNums[(len(sortedNums) - 1) // 2]) / 2
    return sortedNums[(len(sortedNums) - 1) // 2]
        


In [10]:
# Try this
lab.median([0, -1, 1, 100])

Ellipsis

---
**Question 2 (List Distances):**

Similar to Question 0, write a function that takes in a possibly empty list of integers and:
* Returns `True` if there exist two list elements $i$ places apart, whose distance as integers is also $i$.
* Otherwise, returns `False`.

Assume your inputs tend to satisfy the condition, and the pair(s) saitifying the condition tend to be close together; design your function to run faster for this case. (Optimizing your code for an assumed distribution of incoming data is very common in data science).

For example, because `3` and (the second) `5` are two places apart, and $|3-5| = 2$:
```
>>> lab.same_diff_ints([5,3,1,5,9,8])
True
```
Whereas:
```
>>> lab.same_diff_ints([1,3,5,7,9])
False
```

*Note*: Make sure to define some extreme test cases. Use the `%time` command to time your function!

In [11]:
def same_diff_ints(ints):
    """
    same_diff_ints tests whether a list contains
    two list elements i places apart, whose distance
    as integers is also i.

    :param ints: a list of integers
    :returns: a boolean value if ints contains two
    elements as described above.

    :Example:
    >>> same_diff_ints([5,3,1,5,9,8])
    True
    >>> same_diff_ints([1,3,5,7,9])
    False
    """
    if len(ints) < 2:
        return False
    # Concept: since the values are close together, we'll need to iterate by checking distances of 1, 
    # then distances of 2, and so on
    distance = 1
    while (len(ints) - distance) != -1:
        for i in range(len(ints) - distance):
            if abs(ints[i] - ints[i + distance]) == distance:
                return True
        distance += 1
    return False



In [12]:
%time lab.same_diff_ints([5,3,1,5,9,8])

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 5.96 µs


Ellipsis

---
## Strings and Files

The following questions will help you (re)learn the basics of working with strings and reading data from files (which are read in as strings, by default).

---
**Question 3 (Prefixes):**

Write a function `prefixes` that takes a string and returns a string of every consecutive prefix of the input string. For example, `prefixes('Data!')` should return `'DDaDatDataData!'`.  (See the doctests for more examples).

Recall that [strings may be sliced](https://docs.python.org/3/tutorial/introduction.html#strings), like lists.


In [13]:
def prefixes(s):
    """
    prefixes returns a string of every 
    consecutive prefix of the input string.

    :param s: a string.
    :returns: a string of every consecutive prefix of s.

    :Example:
    >>> prefixes('Data!')
    'DDaDatDataData!'
    >>> prefixes('Marina')
    'MMaMarMariMarinMarina'
    >>> prefixes('aaron')
    'aaaaaraaroaaron'
    """
    if len(s) == 0:
        return ""
    return prefixes(s[:-1]) + s



---
**Question 4 (Evens reversed):**

Write a function `evens_reversed` that takes in a non-negative integer $N$ and returns a string containing all even integers from $1$ to $N$ (inclusive) in reversed order, separated by spaces. Additionally, [zero pad](https://www.tutorialspoint.com/python/string_zfill.htm) each integer, so that each has the same length.

In [20]:
def evens_reversed(N):
    """
    evens_reversed returns a string containing 
    all even integers from  1  to  N  (inclusive)
    in reversed order, separated by spaces. 
    Each integer is zero padded.

    :param N: a non-negative integer.
    :returns: a string containing all even integers 
    from 1 to N reversed, formatted as decsribed above.

    :Example:
    >>> evens_reversed(7)
    '6 4 2'
    >>> evens_reversed(10)
    '10 08 06 04 02'
    """
    if N % 2 == 1:
        N -= 1
    digits = len(str(N))
    s = ""
    while N != 0:
        s += " " + str(N).zfill(digits)
        N -= 2
    return s[1:]

In [21]:
evens_reversed(10)

'10 08 06 04 02'

---

[Recall](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) that the built-in function `open` takes in a file path and returns *a file object* (sometimes called a *file handle*). Below are a few properties of file objects:

* `open(path)` opens the file at location `path` for reading.
* `open(path)` is an *iterable*, which contains successive lines of the file.
* Once a file object is opened, after use it should be closed to avoid memory leaks. To ensure a file is closed once done, you should use a *context manager* as follows:
```
with open(path) as fh:
    for line in fh:
        process_line(line)
```
* To read the entire file into a string, use the read method:
```
with open(path) as fh:
    s = fh.read()
```
However, you should be careful when reading an entire file into memory that the file isn't too big! *You should avoid this whenever possible!*

**Question 5 (Reading Files):**

Create a function `last_chars` that takes a file object and returns a string consisting of the last character of the line.

*Remark:* A newline is the "delimiter" of the lines of a file, and doesn't count as part of the line (as the tests imply). Every other character is part of the line. For more info on this, see [the interpretation](https://en.wikipedia.org/wiki/Newline#Interpretation) of files as a 'newline delimited variables' file.



In [25]:
def last_chars(fh):
    """
    last_chars takes a file object and returns a 
    string consisting of the last character of the line.

    :param fh: a file object to read from.
    :returns: a string of last characters from fh

    :Example:
    >>> fp = os.path.join('data', 'chars.txt')
    >>> last_chars(open(fp))
    'hrg'
    """
    s = ""
    for line in fh:
        if len(line.strip("\n")) == 0:
            s += ""
        else:
            s += line.strip("\n")[-1]
    return s

In [26]:
fp = os.path.join('data', 'chars.txt')

In [27]:
last_chars(open(fp))

'hrg'

---

## `numpy` exercises

For an introduction to arrays and `numpy` recall the relevant section of [DSC 10](https://www.inferentialthinking.com/chapters/05/1/Arrays.html).

**Question 6 (Basic Arrays):**

Create the following functions using `numpy` methods satisfying the requirements given in each part. Your solutions should **not** contain any loops or list comprehensions.

* A function `arr_1` that takes in a `numpy` array and adds to each element the square-root of the index of each element.

* A function `arr_2` that takes in a `numpy` array of integers and returns a boolean array (i.e. an array of booleans) whose `ith` element is `True` if and only if the `ith` element of the input array is divisble by 16.

* A function `arr_3` that takes in a `numpy` array of [stock prices](https://en.wikipedia.org/wiki/Stock) per share on successive days in USD and returns an array of growth rates. That is, the `ith` number of the output array should contain the rate of growth in stock price between the $i^{th}$ day to the $(i+1)^{th}$ day. The growth rate should be a proportion, rounded to the nearest hundredth.

* Suppose:
    - `A` is a `numpy` array of [stock prices](https://en.wikipedia.org/wiki/Stock) per share for a company on successive days in USD 
    - you start with \\$20, and put aside \\$20 at the end of each day to buy as much stock as possible the following day. 
    - Any money left-over after a given day is saved for possibly buying stock on a future day. 
    - Create a function `arr_4` that takes in `A` and returns the day on which you can buy at least one share from 'left-over' money. If this never happens, return `-1`. The first stock purchase occurs on day 0. *Note: you cannot buy fractions of a share of stock*.
    
*Example:* If the stock price is \\$3 every day, then the answer is 'day 1':
* day 0: buy 6 shares; \\$2 left-over; \\$22 at end of day.
* day 1: buy 7 shares; \\$1 left-over; \\$21 at end of day.
This is more than the 6 shares that \\$20 can buy.

In [16]:
fp = os.path.join('data', 'stocks.csv')
stocks = np.array([float(x) for x in open(fp)])

In [174]:
def arr_3(A):
    """
    arr_3 takes in a numpy array of stock
    prices per share on successive days in
    USD and returns an array of growth rates.

    :param A: a 1d numpy array.
    :returns: a 1d numpy array.

    :Example:
    >>> fp = os.path.join('data', 'stocks.csv')
    >>> stocks = np.array([float(x) for x in open(fp)])
    >>> out = arr_3(stocks)
    >>> isinstance(out, np.ndarray)
    True
    >>> out.dtype == np.dtype('float')
    True
    >>> out.max() == 0.03
    True
    """
    
    return np.round(A[1:] / A[:-1] - 1, 2)



In [175]:
arr_3(stocks)

array([-0.  ,  0.01, -0.01,  0.  ,  0.  , -0.  ,  0.02,  0.01,  0.02,
        0.01,  0.01,  0.  ,  0.  ,  0.01, -0.  , -0.01, -0.  ,  0.  ,
       -0.01,  0.01,  0.  , -0.  ,  0.02,  0.  ,  0.01, -0.  ,  0.  ,
        0.01, -0.02,  0.01,  0.  , -0.01,  0.01,  0.  , -0.  , -0.01,
        0.01,  0.03, -0.01, -0.  ,  0.01,  0.01,  0.  ,  0.  , -0.  ,
        0.01,  0.01, -0.  ,  0.  ,  0.02, -0.01, -0.01,  0.01, -0.01,
        0.01,  0.02, -0.01, -0.01, -0.  ,  0.01,  0.  , -0.  ,  0.  ,
        0.01, -0.  ,  0.01, -0.  ,  0.01,  0.01,  0.01, -0.  , -0.  ,
        0.01,  0.  ,  0.  ,  0.02, -0.02, -0.  ,  0.01,  0.  ,  0.01,
        0.01, -0.  , -0.02, -0.01, -0.01, -0.01, -0.01,  0.  ,  0.01,
        0.01,  0.01, -0.  ,  0.  , -0.  , -0.01,  0.01,  0.01,  0.  ])

In [65]:
def arr_1(A):
    """
    arr_1 takes in a numpy array and
    adds to each element the square-root of
    the index of each element.

    :param A: a 1d numpy array.
    :returns: a 1d numpy array.

    :Example:
    >>> A = np.array([2, 4, 6, 7])
    >>> out = arr_1(A)
    >>> isinstance(out, np.ndarray)
    True
    >>> np.all(out >= A)
    True
    """

    return A + np.sqrt(np.arange(len(A)))


def arr_2(A):
    """
    arr_2 takes in a numpy array of integers
    and returns a boolean array (i.e. an array of booleans)
    whose ith element is True if and only if the ith element
    of the input array is divisble by 16.

    :param A: a 1d numpy array.
    :returns: a 1d numpy boolean array.

    :Example:
    >>> out = arr_2(np.array([1, 2, 16, 17, 32, 33]))
    >>> isinstance(out, np.ndarray)
    True
    >>> out.dtype == np.dtype('bool')
    True
    """

    return A % 16 == 0


def arr_3(A):
    """
    arr_3 takes in a numpy array of stock
    prices per share on successive days in
    USD and returns an array of growth rates.

    :param A: a 1d numpy array.
    :returns: a 1d numpy array.

    :Example:
    >>> fp = os.path.join('data', 'stocks.csv')
    >>> stocks = np.array([float(x) for x in open(fp)])
    >>> out = arr_3(stocks)
    >>> isinstance(out, np.ndarray)
    True
    >>> out.dtype == np.dtype('float')
    True
    >>> out.max() == 0.03
    True
    """
    
    return np.round(A[:-1] / A[1:] - 1, 2)
    

def arr_4(A):
    """
    Create a function arr_4 that takes in A and 
    returns the day on which you can buy at least 
    one share from 'left-over' money. If this never 
    happens, return -1. The first stock purchase occurs on day 0
    :param A: a 1d numpy array of stock prices.
    :returns: an integer of the total number of shares.

    :Example:
    >>> import numbers
    >>> stocks = np.array([3, 3, 3, 3])
    >>> out = arr_4(stocks)
    >>> isinstance(out, numbers.Integral)
    True
    >>> out == 1
    True
    """
    leftOver = np.cumsum(20 % A)
    ans = np.where(A <= leftOver)
    if len(ans[0]):
        return(ans[0][0])
    return(-1)

In [80]:
A = np.array([3, 5, 6, 4])
leftOver = np.cumsum(20 % A)
ans = np.where(A <= leftOver)
if len(ans[0]):
    print(ans[0][0])
else:
    print(-1)

[2 2 4 4]
[3 5 6 4]
3


In [73]:
A

array([3, 3, 5, 5])

In [74]:
leftOver

array([2, 4, 4, 4])

In [64]:
A = np.array([3, 5, 5, 5])
leftOver = np.cumsum(20 % A)
print(leftOver)
ans = np.where(A[1:] <= leftOver[:-1])
if ans[0]:
    print(ans[0][0])
else:
    print(-1)

[2 2 2 2]
-1


  """


---
## Getting Started with Pandas

The following questions will help you get comfortable with Pandas. These questions are similar to questions on tables in DSC 10; review the [textbook](https://www.inferentialthinking.com) as necessary. As always for Pandas questions:
1. Avoid writing loops through the rows of the dataset to do the problem, and
2. Test the output/correctness of your code with the help of the dataset given, but be sure your code will also run on data "like" the dataset given (sampling rows using the `.sample` method is useful for this!).

**Question 7 (Pandas basics):**

Read in the file `movies_by_year.csv` in the `data` directory and understand the dataset by answering the following questions. To do this, create a function `movie_stats` that takes in a dataframe like `movies` and returns a series containing the following statistics:
* The number of years covered by the dataset (`num_years`).
* The total number of movies made over all years in the dataset (`tot_movies`).
* The year with the fewest number of movies made; a tie should return the earliest year (`yr_fewest_movies`).
* The average amount of money grossed over all the years in the dataset (`avg_gross`).
* The year with the highest gross *per movie* (`highest_per_movie`).
* The name of the top movie during the second-lowest (total) grossing year (`second_lowest`).
* The average number of movies made the year *after* a Harry Potter movie was the #1 movie (`avg_after_harry`).

The index of the output series are given in parenthesis above.

*Note*: Your function should work on a dataset of the same format that contains information from other years. You may assume that none of the answers involving ranking returns a tie.

*Note*: To make sure your function still runs, in the event that one of the 7 parts throws an exception (e.g. due to a very incorrect answer), use `Try... Except...` structures.

In [17]:
movie_fp = os.path.join('data', 'movies_by_year.csv')
movies = pd.read_csv(movie_fp)

In [89]:
movies.head()

Unnamed: 0,Year,Total Gross,Number of Movies,#1 Movie
0,2015,11128.5,702,Star Wars: The Force Awakens
1,2014,10360.8,702,American Sniper
2,2013,10923.6,688,Catching Fire
3,2012,10837.4,667,The Avengers
4,2011,10174.3,602,Harry Potter / Deathly Hallows (P2)


In [131]:
np.mean(movies[movies["Year"] > firstHarryPotterYear]["Number of Movies"])

NameError: name 'firstHarryPotterYear' is not defined

In [133]:
movies = movies.sort_values("Year")
d = {}
d["num_years"] = len(movies.groupby("Year").count())
d["tot_movies"] = sum(movies["Number of Movies"])
d["yr_fewest_movies"] = movies[movies["Number of Movies"] == min(movies["Number of Movies"])].sort_values("Year").iloc[0]["Year"]
d["avg_gross"] = sum(movies["Total Gross"]) / d["num_years"]
d["highest_per_movie"] = movies.loc[(movies["Total Gross"] / movies["Number of Movies"]).idxmin()]["Year"]
d["second_lowest"] = movies.sort_values("Total Gross").iloc[1]["#1 Movie"]
firstHarryPotterYear = movies[movies["#1 Movie"].str.contains("Harry Potter")].sort_values("Year").iloc[0]["Year"]
d["avg_after_harry"] = np.mean(movies[movies["Year"] > firstHarryPotterYear]["Number of Movies"])
pd.Series(d)

num_years                            34
tot_movies                        17834
yr_fewest_movies                   1990
avg_gross                       7226.91
highest_per_movie                  1984
second_lowest        Back to the Future
avg_after_harry                 596.286
dtype: object

In [96]:
movies[movies["Number of Movies"] == min(movies["Number of Movies"])].sort_values("Year").iloc[0]["Year"]

Year                      1990
Total Gross             5021.8
Number of Movies           410
#1 Movie            Home Alone
Name: 25, dtype: object

In [None]:
# ---------------------------------------------------------------------
# Question # 7
# ---------------------------------------------------------------------

def movie_stats(movies):
    """
    movies_stats returns a series as specified in the notebook.

    :param movies: a dataframe of summaries of
    movies per year as found in `movies_by_year.csv`
    :return: a series with index specified in the notebook.

    :Example:
    >>> movie_fp = os.path.join('data', 'movies_by_year.csv')
    >>> movies = pd.read_csv(movie_fp)
    >>> out = movie_stats(movies)
    >>> isinstance(out, pd.Series)
    True
    >>> 'num_years' in out.index
    True
    >>> isinstance(out.loc['second_lowest'], str)
    True
    """
    movies = movies.sort_values("Year")
    d = {}
    try:
        d["num_years"] = len(movies.groupby("Year").count())
    except:
        d["num_years"] = -1
    try:
        d["tot_movies"] = sum(movies["Number of Movies"])
    except:
        d["tot_movies"] = -1
    try:
        d["yr_fewest_movies"] = movies[movies["Number of Movies"] == min(movies["Number of Movies"])].sort_values("Year").iloc[0]["Year"]
    except:
        d["yr_fewest_movies"] = -1
    try:
        d["avg_gross"] = sum(movies["Total Gross"]) / d["num_years"]
    except:
        d["avg_gross"] = -1
    try:
        d["highest_per_movie"] = movies.loc[(movies["Total Gross"] / movies["Number of Movies"]).idxmin()]["Year"]
    except:
        d["highest_per_movie"] = -1
    try:
        d["second_lowest"] = movies.sort_values("Total Gross").iloc[1]["#1 Movie"]
    except:
        d["second_lowest"] = -1
    try:
        firstHarryPotterYear = movies[movies["#1 Movie"].str.contains("Harry Potter")].sort_values("Year").iloc[0]["Year"]
        d["avg_after_harry"] = np.mean(movies[movies["Year"] > firstHarryPotterYear]["Number of Movies"])
    except:
        d["avg_after_harry"] = -1
    return pd.Series(d)

---

## CSV Files

**Question 8 (Reading malformed csv files):**

`malformed.csv` contains a file of comma-separated values, containing the following fields:


|column name|description|type|
|---|---|---|
|first|first name of person|str|
|last|last name of person|str|
|weight|weight of person (lbs)|float|
|height|height of person (in)|float|
|geo|location of person; comma-separated latitude/longitude|str|

Unfortunately, the entries contains errors that cause the Pandas `read_csv` function to fail parsing the file with the default settings. Instead, you must read in the file manually using Python's built-in `open` function.

Clean the csv file into a Pandas DataFrame with columns as described in the table above, by creating a function called `parse_malformed` that takes in a file path and returns a parsed, properly-typed dataframe. The dataframe should contain columns as described in the table above (with the specified types); it should agree with `pd.read_csv` when the lines are not malformed.


*Note:* Assume that the given csv file is a sample of a larger file; you will be graded against a **different** sample of the larger file that has the same type of parsing errors. That is, you should **not** hard-code your cleaning of the data to specific errors on specific lines in the data.

In [134]:
fp = os.path.join('data', 'malformed.csv')

In [168]:
with open(fp, "r") as csv:
    l = []
    columns = csv.readline().strip().split(",")
    for line in csv:
        row = line.strip().split(",")
        row = list(filter(None, row))
        row[4] = f"{row[4]}, {row[5]}".replace('"', '')
        row = row[:-1]
        row[3] = float(row[3].replace('"', ''))
        row[2] = float(row[2].replace('"', ''))
        l.append(row)
df = pd.DataFrame(l, columns=columns)

In [169]:
df.iloc[9:13]

Unnamed: 0,first,last,weight,height,geo
9,Yvette,Trayce,179.0,67.0,"36.9, -8.3"
10,Cody,Hatim,150.0,63.0,"38.0, -7.3"
11,Marissa,Daud,135.0,58.0,"37.3, 11.0"
12,Logan,Cristel,133.0,67.0,"35.5, -110.2"


In [171]:
pd.read_csv(fp, nrows=4, skiprows=10, names=columns)

Unnamed: 0,first,last,weight,height,geo
0,Yvette,Trayce,179.0,67.0,"36.9,-8.3"
1,Cody,Hatim,150.0,63.0,"38.0,-7.3"
2,Marissa,Daud,135.0,58.0,"37.3,11.0"
3,Logan,Cristel,133.0,67.0,"35.5,-110.2"


In [None]:


# ---------------------------------------------------------------------
# Question # 8
# ---------------------------------------------------------------------

def parse_malformed(fp):
    """
    Parses and loads the malformed csv data into a 
    properly formatted dataframe (as described in 
    the question).

    :param fh: file handle for the malformed csv-file.
    :returns: a Pandas DataFrame of the data, 
    as specificed in the question statement.

    :Example:
    >>> fp = os.path.join('data', 'malformed.csv')
    >>> df = parse_malformed(fp)
    >>> cols = ['first', 'last', 'weight', 'height', 'geo']
    >>> list(df.columns) == cols
    True
    >>> df['last'].dtype == np.dtype('O')
    True
    >>> df['height'].dtype == np.dtype('float64')
    True
    >>> df['geo'].str.contains(',').all()
    True
    >>> len(df) == 100
    True
    >>> dg = pd.read_csv(fp, nrows=4, skiprows=10, names=cols)
    >>> dg.index = range(9, 13)
    >>> (dg == df.iloc[9:13]).all().all()
    True
    """

    with open(fp, "r") as csv:
    l = []
    columns = csv.readline().strip().split(",")
    for line in csv:
        row = line.strip().split(",")
        row = list(filter(None, row))
        row[4] = f"{row[4]},{row[5]}".replace('"', '')
        row = row[:-1]
        row[3] = float(row[3].replace('"', ''))
        row[2] = float(row[2].replace('"', ''))
        l.append(row)
    return pd.DataFrame(l, columns=columns)

## Congratulations! You're done!

* Submit the lab on Gradescope