# Working with Data

**Fall 2024 - Instructor:  Chris Volinsky**

**Teaching Assistants: Aditya Deshpande, Stuti Mishra,**

<!-- previously called Dealing_with_Data -->

This notebook discusses how data is stored and managed in Python, as well as some basic exploratory functions.   These basic tools will be used over and over again this year, so it will be good for you to be VERY comfortable with them!  

**I strongly suggest that you work through the Chapter 2.3 Lab in the ISL book and Chapter 2.3-2.4 in the Shmueli book - they works through many of these same topics with excellent examples.**  If this material is too hard, you may struggle with completing assignements in this class.


## Python Packages and Built-in Functions

Python has a ton of packages that make doing complicated stuff very easy.

Packages contain pre-defined functions (built-in) that make our life easier!  We've seen pre-defined functions before, for example, the function 'str()' that we used to convert numbers into strings in the Python Basics notebook.

In this class we will use five packages frequently:

- **`numpy`** (pronounced num-pie) is used for doing "math stuff", such as complex mathematical operations (e.g., square roots, exponents, logs), operations on matrices, and more.  
- **`pandas`** is a data manipulation package. It lets us store data in a data frame--which is the basic data structure used in data analytics. More on this soon.
- **`sklearn`** is a machine learning and data science package. It lets us do fairly complicated machine learning tasks, such as building regression or probability estimation models with only a few lines of code. (Nice!)
- **`matplotlib`** is a data visualization package.  It lets you make plots and graphs directly from your code. This can be a secret weapon when combined with notebooks, as you can very easily rerun analyses on different data or with slightly different code, and the graphs can just appear magically.  (Ok, always easier said than done, but you get the idea.)
- **`seaborn`** is an extension to matplotlib that helps make your plots look more appealing.



As we use these through the semester, their usefulness will become increasingly apparent.

To make the contents of a package available, you need to **import** it:

In [None]:
# load entire package
import numpy
import pandas
import sklearn
import matplotlib
import seaborn

Most of our notebooks will start with these imports.  In fact, it doesnt hurt to simply add these lines to the beginning of every notebook, whether or not you need it!   

Sometimes it is easier to use short nicknames for packages. Many of these are standard in the Python community so it is good for you to recognize these in case you see it in code.  In fact, it is good practice to just add this block of code to the beginning of every project, in case you need these!

In [None]:
# Load package and assign to shorter variable name
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# this trick is required to get plots to display inline with the rest of your notebook,
# not in a separate window
%matplotlib inline

We can now use package-specific things. For example, numpy has many basic mathematical functions - including a function called `sqrt()` which will give us the square root of a number. Since it is part of numpy, we need to tell Python that that's where it is by using a dot (e.g., `np.sqrt()`).

In [None]:
num = 36
print ("Square root of: " + str(num) + " is " + str ( np.sqrt(num) ))


Square root of: 36 is 6.0


##**Pandas** and DataFrames




Pandas takes the main data structures of Python  and organizes them into a format that makes data analysis very convenient - **DATAFRAMES**

A Dataframe is 2-dimensional data structure with columns of potentially different types, along with column and row labels.  Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.  (technically, Pandas data frames are made from an abstraction of lists that Pandas used called "series" - for more details you can [look here](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) )

Pandas data frames can be constructed from most common data sources a data scientist will encounter: csv files, excel spreadsheets, sql databases, json, url pointers to other data sources, and even from other data already stored in one's python code.


### Reading in Data

Data sets often contain different types of data, and may have names associated with the rows or columns.  Data frames are perfect for this type of data - often read in from a CSV or Excel spreadsheet.  

We can think of a data frame as a sequence of arrays of identical length; these are the columns. Entries in the different arrays can be combined to form a row.


You can read data in from a file online using the URL, or from a local file on your computer that has been downloaded.  

For more info on accessing data through APIs, see [Getting Data Through APIs](https://colab.research.google.com/drive/1jDBkbG8yEAaEAGEzNIQVa4LRbXeuqEWt)


Let's access a classic DS dataset, which has data on different car styles:

In [None]:
# This reads the data from a url and sets the column names.
# Here the data is nicely formatted as a CSV file for us
# but sometimes we will need to download data to our local machine and import into Colab
# Pandas can also input JSON or XML data if formatted properly

url = "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original"
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model', 'origin', 'car_name']
mpg_df = pd.read_csv(url, delim_whitespace=True, header=None, names=column_names)

In [None]:
# First, just get a peek at the data:
mpg_df.head()

In [None]:
# Some general stats about the data
mpg_df.describe()

In [None]:
# How many missing values of each column?
mpg_df.isnull().sum()

In [None]:
# How many of each type of engine?
mpg_df["cylinders"].value_counts()

In [None]:
# Average horsepower per engine type
mpg_df.groupby("cylinders").horsepower.mean()

In [None]:
# Plot a histogram of mpg - (use matplotlib)
plt.hist(mpg_df.mpg, edgecolor='black')

In [None]:
# Or a scatter plot of acceleration vs mpg (with labels!)
plt.scatter( mpg_df.acceleration, mpg_df.mpg )
plt.xlabel("acceleration")
plt.ylabel("mpg")

In [None]:
# Fancy scatterplot (jointplot) from seaborn:
sns.jointplot(x="weight", y="mpg", data=mpg_df, kind="reg")

Pandas is widely used and has a very active development community contributing new features. If there is some kind of analysis you want to do on your data, chances are, it already exists. The [documentation for the pandas library](https://pandas.pydata.org/pandas-docs/stable/) is very good, and of course there is much help you can get from AI or Stack Overflow.

One important component of pandas is indexing and selecting components of the data. This is a extremely rich topic, so we'll only touch on it here. Please [consult the documentation](https://pandas.pydata.org/pandas-docs/stable/indexing.html) for more info.

In [None]:
# Columns can be selected using the `[]` operator, which accepts one column name or a list of several
# using two brackets [[ ]] ensures that the output is also a data frame

mpg_df[["cylinders", "car_name"]]

In [None]:
# pandas also allows selection using the `.column_name` notation
mpg_df.car_name

For selecting rows from the data there are two options:
- `.loc`: for selecting rows based on the _row label_
- `.iloc`: for selecting rows based on the _row number_

In the prior example, the row label and the row number are the same; often one wants to assign a label (a unique id) to each rows. In many cases, this would be something like a date or a user id. Note: these two selectors can also be used to pick columns, but that's a bit less common.

In [None]:
# Returns row #5 -- the 6th row.  NB: it returns one row as a column!
mpg_df.iloc[5]

In [None]:
# Returns the first 6 rows
mpg_df.iloc[:6]

If we have _actual labels_ as an index for a dataframe, we can use `.loc` to select using values from that index

In [None]:
#Let's set the row label (index) to be the car name.  Note: these are not unique.
car_index_df = mpg_df.set_index("car_name", inplace=False)
#Now, let's see what we have for a couple car names
car_index_df.loc[["amc rebel sst", "plymouth fury iii"]]

One can also select those rows that match a particular condition. Say I want to only see those rows that have an acceleration less that 10 seconds

In [None]:
mpg_df[mpg_df.acceleration < 10]  #some of the classic muscle cars


What about operations on entire columns? This can make data munging much easier!  

When preparing data for predictive modeling we do "feature engineering" -- creating new variables that we believe/hope will be predictive of some target.  

Let's calculate a new variable which is weight divided by displacement, and add it to the mpg_df data frame:


In [None]:
mpg_df["WoverD"] = mpg_df["weight"] / mpg_df["displacement"]
mpg_df[["weight","displacement","WoverD"]]



In [None]:
# identify the cars with the top 3 and bottom 3 WoverD values

print(mpg_df.nlargest(3, "WoverD"))
print(mpg_df.nsmallest(3, "WoverD"))


Extra Questions:
- Find the average horsepower of the 5 cars with the lowest MPG.  Also the highest.
- Create a new feature that says whether the car has MPG which is higher or lower than its median.   Plot a boxplot of horsepower against your new binary variable.

