# Lab 1 - Python Assignment 

Upon successful completion of this assignment, a student will be able to:

* Correctly setup Python environment on Campus Linux Machines
* Add new text and code cells to a colab notebook
* Gain experience in formatting text using Markdown
* Load in a data set, access it, and explore its properties.
* Submit assignment to Gradescope.

We start with the standard setup for our notebook files importing standard modules.

In [1]:
#  Import standard modules  
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

import otter
grader = otter.Notebook()

In [2]:
# Import modules for this lab 
import re 
import os
import platform 
import sys 

import importlib
from packaging.version import Version, parse 

<!-- BEGIN QUESTION -->

## Q1 - Setup

The following code looks to see whether your notebook is run on Gradescope (GS), Colab (COLAB), or the linux lab machine Python environment you were asked to setup.

In [3]:
# flag if notebook is running on Gradescope 
if re.search(r'amzn', platform.uname().release): 
    GS = True
else: 
    GS = False

# flag if notebook is running on Colaboratory 
try:
  import google.colab
  COLAB = True
except:
  COLAB = False

# flag if running on Linux lab machines. 
cname = platform.uname().node
if re.search(r'(guardian|colossus|c28|coc-15954-m)', cname):
    LLM = True 
else: 
    LLM = False

print("System: GS - %s, COLAB - %s, LLM - %s" % (GS, COLAB, LLM))

System: GS - False, COLAB - False, LLM - True


### Check Setup 

Check to make sure the correct version of Python was run.

In [4]:
pver = sys.version 
print(pver) 

3.10.12 (main, Jul  5 2023, 18:54:27) [GCC 11.2.0]


It is good practice to list all imports needed at the top of the notebook. You can import modules in 
later cells as needed, but listing them at the top clearly shows all which are needed to be available / installed.

If you are doing development on Colab, the otter-grader package is not available, so you will need to install it
with pip `!pip install otter-grader==5.1`.

The python environment that is running is: 

In [5]:
env1 = sys.executable
print(env1)

/home/campus19/trkosire/.conda/envs/un5550/bin/python


In [6]:
env2 =!conda info | grep 'active env'
print(env2)

['     active environment : un5550', '    active env location : /home/campus19/trkosire/.conda/envs/un5550']


Make sure that the environment you set up for the class is what is being used to execute your notebook. For example, the default name should be "un5550". 

Next, we are going to look at all the packages installed. 

In [7]:
OK = '\x1b[42m[ OK ]\x1b[0m'
FAIL = "x1b[41m[FAIL]\x1b[0m"

def import_version(pkg, req_ver, fail_msg=""):
    mod = None
    try:
        mod = importlib.import_module(pkg)
        ver = mod.__version__
        if Version(ver) != req_ver:
            print(FAIL, "%s version %s required, but %s installed."
                  % (lib, req_ver, ver))
        else:
            print(OK, '%s version %s' % (pkg, ver))
    except ImportError:
        print(FAIL, '%s not installed. %s' % (pkg, fail_msg))
    return (mod, Version(ver), req_ver)

requirements = {'numpy': parse("1.25.2"), 'scipy': parse("1.11.1"),
                'matplotlib': parse("3.7.1"), 'pandas': parse("2.0.3"),
                'IPython': parse("8.14.0"), 'seaborn': parse('0.12.2'),
                'plotly': parse("5.9.0"), 'dill': parse('0.3.7'),
                'sklearn': parse("1.3.0")
                }

pks = []
for lib, required_version in list(requirements.items()):
    pks.append(import_version(lib, required_version))

[42m[ OK ][0m numpy version 1.25.2
[42m[ OK ][0m scipy version 1.11.1
[42m[ OK ][0m matplotlib version 3.7.1
[42m[ OK ][0m pandas version 2.0.3
[42m[ OK ][0m IPython version 8.14.0
[42m[ OK ][0m seaborn version 0.12.2
[42m[ OK ][0m plotly version 5.9.0
[42m[ OK ][0m dill version 0.3.7
[42m[ OK ][0m sklearn version 1.3.0


<!-- END QUESTION -->

## Example 1 - More Data Cleaning 
*Adapted from J. Sullivan*

Let's look at another data file to see additional data cleaning steps and code.  

The initial data set reads in part: 

![property data](https://pages.mtu.edu/~lebrown/un5550-f20/week1/property-data.jpg)

In [8]:
prop = pd.read_csv("data/property.csv")
prop

Unnamed: 0,PID,ST_NUM,ST_NAME,OWN_OCCUPIED,NUM_BEDROOMS,NUM_BATH,SQ_FT
0,100001000.0,104.0,PUTNAM,Y,3,1,1000
1,100002000.0,197.0,LEXINGTON,N,3,1.5,--
2,100003000.0,,LEXINGTON,N,,1,850
3,100004000.0,201.0,BERKELEY,12,1,,700
4,,203.0,BERKELEY,Y,3,2,1600
5,100006000.0,207.0,BERKELEY,Y,,1,800
6,100007000.0,,WASHINGTON,,2,HURLEY,950
7,100008000.0,213.0,TREMONT,Y,--,1,
8,100009000.0,215.0,TREMONT,Y,na,2,1800


We can see that `pandas` is already able to find some of the different ways that we have missing values in the data.

For instance in the ST_NUM column, the 3rd entry is blank and the 7th entry is NaN.  `pandas` filled in the blank entry with "NA".  Both of these values are found by the `isnull()` method.

In [9]:
prop['ST_NUM'].isnull()

0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False
Name: ST_NUM, dtype: bool

However, there are other missing value encodings that pandas does not immediately recognize. 

Let's look at the Num_Bedrooms column. 

![property data 2](https://pages.mtu.edu/~lebrown/un5550-f20/week1/property-data2.jpg)




In this column, we have missing values as "n/a", "NA", "--" and "na".

Let's see what `pandas` automatically recognizes.

In [10]:
prop['NUM_BEDROOMS'].isnull()

0    False
1    False
2     True
3    False
4    False
5     True
6    False
7    False
8    False
Name: NUM_BEDROOMS, dtype: bool

`pandas` automatically recognizes the "n/a" and "NA" but not the "--" and "na". 

Let's change that! 

In [12]:
# Making a list of missing value types
missing_values = ["n/a", "na", "--", "NA"]
prop2 = pd.read_csv("data/property.csv", na_values = missing_values)

In [13]:
print (prop2['NUM_BEDROOMS'])
print (prop2['NUM_BEDROOMS'].isnull())

0    3.0
1    3.0
2    NaN
3    1.0
4    3.0
5    NaN
6    2.0
7    NaN
8    NaN
Name: NUM_BEDROOMS, dtype: float64
0    False
1    False
2     True
3    False
4    False
5     True
6    False
7     True
8     True
Name: NUM_BEDROOMS, dtype: bool


## Example 2 - Printing


In many courses, tutorials for new languages the first thing you learn is printing "Hello World"

In [14]:
print('Hello World')

Hello World


We can also capture `input` from the user. 
https://docs.python.org/3/library/functions.html#input

In [16]:
firstName = input('What is your name?  ')

What is your name?   Tagore


In [17]:
"Hello " + firstName + "!"

'Hello Tagore!'

Use inbuilt function `dir()` to the variable "firstName" above and print the outcome.

https://docs.python.org/3/library/functions.html#dir

In [18]:
dir(firstName)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',


This lists all the functions available to be used on the "string" `firstName'

## Q2 - Strings

I want you to explore using the string functions: `len()`, `split()`, and `strip()` on the following strings. 

https://docs.python.org/3/library/functions.html

In [19]:
className = " Introduction   to   Data Science   "

In [22]:
# Show how to find the length of the string "className" 
# Store the results in a new variable "class_length"
class_length = len(className)
class_length

36

In [23]:
# Show the results of the `split()` function on the string "className"  
# Store the results in a new variable "class_split"
class_split = className.split()
class_split

['Introduction', 'to', 'Data', 'Science']

In [24]:
# Save the results of the `strip()` function on the string "className" in a 
# new variable "className2"
className2 = className.strip()
className2

'Introduction   to   Data Science'

In [25]:
grader.check("q2")

## Example 3 - Comments 

To create a comment line (in line with the code), # (hash) symbol is used, followed by a space. (Short key: Ctrl+/ ) [To comment out, remove # or use Ctrl+/ again]

Other options are using the triple quotes (""")or (''') known as backticks, to enclose the complete sentence as a comment.(This needs to be on different line other than the code). Different programming language has different approches for commenting. Please be aware.

In [26]:
# This is a comment

In [27]:
'''This is a larger comment block 
that may span multiple lines 
'''
2 + 2

4

<!-- BEGIN QUESTION -->

## Q3 - Markdown

Markdown option for cells in the jupyter notebook provides a way to display information to the use around the particular code snippets. For more information and reading, please look into:

https://help.github.com/articles/basic-writing-and-formatting-syntax/

https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html

Colab's Markdown Guide: https://colab.research.google.com/notebooks/markdown_guide.ipynb#scrollTo=5Y3CStVkLxqt

For this exercise add a new 'Text' cell and try to recreate the following block of text. 

![example markdown](https://pages.mtu.edu/~lebrown/un5550-f20/week1/markdown-example.png)




We can start with a few different paragraphs of text. This first paragraph will have a few sentences with various markups
found. Things like **bold** *italics* ~~strikethrough~~, and even `mononspace`.

Here is another paragraph of text that conatains a url https://mtu.edu.

We can have lists:

*one
*two
*three

And more lists:

1.one
1.two
1.three

Nested lists:

* one
  * one A
  * one B
* two
* three




*Enter your Markdown here*

<!-- END QUESTION -->

## Example 4 - String Operations 

Here you can see some more operations working with strings.

https://docs.python.org/3/library/stdtypes.html#str

In [28]:
str = "Hello Data Science 2023"

In [29]:
print(str.find("2023"))

19


In [30]:
print(str[-4:])

2023


In [31]:
str.upper()

'HELLO DATA SCIENCE 2023'

In [32]:
str.lower()

'hello data science 2023'

In [33]:
str + ' & ' + 'FutureDataScientist'

'Hello Data Science 2023 & FutureDataScientist'

## Q4 - Pandas 

Pandas Resources:
* https://pandas.pydata.org/
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

We are going to be using the Abalone data set.  This is part of the UCI Machine Learning repository.  A common place to find data sets to test out code and used in learning about machine learning and data science. 

I have already downloaded the data from https://archive.ics.uci.edu/dataset/1/abalone
 
In the next cell, you will modify the code to read in the `abalone.data` file properly.  Use the following names for the columns:  
`sex`, `len`, `diameter`, `height`, `wh_wgt`, `shuck_wgt`, `vis_wgt`, `sh_wgt`, `rings`

*HINT:* You will need to look at using additional parameters for the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function. It will be helpful to look at the documentation on `read_csv`   
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html



In [50]:
df = pd.read_csv('data/abalone.data',names=['sex','len','diameter','height','wh_wgt','shuck_wgt','vis_wgt','sh_wgt','rings'])  # modify this code to properly read the data
# use the column names provided above 
df.head()


Unnamed: 0,sex,len,diameter,height,wh_wgt,shuck_wgt,vis_wgt,sh_wgt,rings
0,M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
1,M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
2,F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
3,M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
4,I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7


In [51]:
grader.check("q4")

## Q5 - Pandas 

Here you will explore properties of the DataFrame and its attributes.

In [53]:
# Determine the number of rows and columns of the data set 
rows = df.shape[0]
columns = df.shape[1]
# Determine what are the column names 
dfColumnNames = df.columns.values.tolist()

print(f'No of rows: {rows}')
print(f'No of columns: {columns}')
dfColumnNames

No of rows: 4177
No of columns: 9


['sex',
 'len',
 'diameter',
 'height',
 'wh_wgt',
 'shuck_wgt',
 'vis_wgt',
 'sh_wgt',
 'rings']

In [54]:
grader.check("q5")

## Q6 - Pandas 

Show the first 4 rows of the DataFrame.

Show the last 7 rows of the DataFrame.

In [55]:
first_4_rows = df.head(4)
last_7_rows = df.tail(7) 
print(first_4_rows)
print(last_7_rows)

  sex    len  diameter  height  wh_wgt  shuck_wgt  vis_wgt  sh_wgt  rings
0   M  0.455     0.365   0.095  0.5140     0.2245   0.1010   0.150     15
1   M  0.350     0.265   0.090  0.2255     0.0995   0.0485   0.070      7
2   F  0.530     0.420   0.135  0.6770     0.2565   0.1415   0.210      9
3   M  0.440     0.365   0.125  0.5160     0.2155   0.1140   0.155     10
     sex    len  diameter  height  wh_wgt  shuck_wgt  vis_wgt  sh_wgt  rings
4170   M  0.550     0.430   0.130  0.8395     0.3155   0.1955  0.2405     10
4171   M  0.560     0.430   0.155  0.8675     0.4000   0.1720  0.2290      8
4172   F  0.565     0.450   0.165  0.8870     0.3700   0.2390  0.2490     11
4173   M  0.590     0.440   0.135  0.9660     0.4390   0.2145  0.2605     10
4174   M  0.600     0.475   0.205  1.1760     0.5255   0.2875  0.3080      9
4175   F  0.625     0.485   0.150  1.0945     0.5310   0.2610  0.2960     10
4176   M  0.710     0.555   0.195  1.9485     0.9455   0.3765  0.4950     12


In [56]:
grader.check("q6")

## Q7 - Pandas 

Practice selecting different parts of the DataFrame

Select the `sh_wgt` column

_Type your answer here, replacing this text._

In [58]:
# select just the sh_wgt column 
shell_wgt = df['sh_wgt']
shell_wgt

0       0.1500
1       0.0700
2       0.2100
3       0.1550
4       0.0550
         ...  
4172    0.2490
4173    0.2605
4174    0.3080
4175    0.2960
4176    0.4950
Name: sh_wgt, Length: 4177, dtype: float64

In [61]:
diameter_and_height = df[['diameter','height']]
diameter_and_height

Unnamed: 0,diameter,height
0,0.365,0.095
1,0.265,0.090
2,0.420,0.135
3,0.365,0.125
4,0.255,0.080
...,...,...
4172,0.450,0.165
4173,0.440,0.135
4174,0.475,0.205
4175,0.485,0.150


In [62]:
grader.check("q7")

## Q8 - Pandas 

Select the following: 
* `row_5` - row with index=5, the 6th row, of the DataFrame 
* `row_6_8` - the 6th and 8th row of the DataFrame, and 
* `ansC` - every other row and every third column starting from the 2nd row and 3rd column



_Type your answer here, replacing this text._

In [64]:
index_5 = df.iloc[5]
index_5

sex               I
len           0.425
diameter        0.3
height        0.095
wh_wgt       0.3515
shuck_wgt     0.141
vis_wgt      0.0775
sh_wgt         0.12
rings             8
Name: 5, dtype: object

In [65]:
row_6_8 = df.iloc[[5,7]]
row_6_8

Unnamed: 0,sex,len,diameter,height,wh_wgt,shuck_wgt,vis_wgt,sh_wgt,rings
5,I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
7,F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16


In [71]:
ansC = df.iloc[1::,2::]
ansC

Unnamed: 0,diameter,height,wh_wgt,shuck_wgt,vis_wgt,sh_wgt,rings
1,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
2,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
3,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
4,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
5,0.300,0.095,0.3515,0.1410,0.0775,0.1200,8
...,...,...,...,...,...,...,...
4172,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4173,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4174,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4175,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


In [72]:
grader.check("q8")

## Q9 - Data Selection and Statistics 

Perform `mean()`, `max()`, and `min()`  for first 10 data points for all the weight columns.

*Hint: remember df.head(10) returns the first 10 rows of the DataFrame*

In [110]:
meanVals =  df[['wh_wgt','shuck_wgt','vis_wgt','sh_wgt']].head(10).mean()
meanVals

wh_wgt       0.54385
shuck_wgt    0.20885
vis_wgt      0.10765
sh_wgt       0.18350
dtype: float64

In [111]:
maxVals = df[['wh_wgt','shuck_wgt','vis_wgt','sh_wgt']].head(10).max()
maxVals

wh_wgt       0.8945
shuck_wgt    0.3145
vis_wgt      0.1510
sh_wgt       0.3300
dtype: float64

In [112]:
minVals = df[['wh_wgt','shuck_wgt','vis_wgt','sh_wgt']].head(10).min()
minVals

wh_wgt       0.2050
shuck_wgt    0.0895
vis_wgt      0.0395
sh_wgt       0.0550
dtype: float64

In [113]:
grader.check("q9")

## Q10 - Data Selection and Statistics 

Group by column "sex" and find the median for the other variables. 

In [98]:
group =  df.groupby('sex').median()
group

Unnamed: 0_level_0,len,diameter,height,wh_wgt,shuck_wgt,vis_wgt,sh_wgt,rings
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,0.59,0.465,0.16,1.0385,0.4405,0.224,0.295,10.0
I,0.435,0.335,0.11,0.384,0.16975,0.0805,0.113,8.0
M,0.58,0.455,0.155,0.97575,0.42175,0.21,0.276,10.0


In [99]:
grader.check("q10")

## Bonus - Data Selection and Statistics 

Find the mean weights of abolone with more than 12 rings. 

In [123]:
mean_vals= df.loc[df['rings']>12]
mean_vals =  mean_vals[['wh_wgt','shuck_wgt','vis_wgt','sh_wgt']].mean()
mean_vals

wh_wgt       1.119511
shuck_wgt    0.432494
vis_wgt      0.238449
sh_wgt       0.350519
dtype: float64

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [124]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)

Running your submission against local test cases...



Your submission received the following results when run against available test cases:

    q2 results: All test cases passed!

    q4 results: All test cases passed!

    q5 results: All test cases passed!

    q6 results: All test cases passed!

    q7 results: All test cases passed!

    q8 results: All test cases passed!

    q9 results: All test cases passed!

    q10 results: All test cases passed!
