# Python Libraries

Previously, we learned how to write our own functions that could be called multiple times. It is possible to call functions that other people have written as well! We do this by importing files that define the various functions we could use. Two of the most important Libraries (collection of files with pre-defined functions) are **numpy** and **pandas**. 

**numpy**, short for **num**erical **py**thon, specializes in arrays and is heavily used for its RANDOM functions. 

In [None]:
import numpy

# randint(VALUE) generates a random integer between 0 and VALUE (not including VALUE)
x = numpy.random.randint(100)
print(x)

In [None]:
# Alternatively, we JUST import the random functions from numpy and ignore the rest of the library
# When we do this, we don't need to include the library name
from numpy import random

x = random.randint(100)
print(x)

**pandas** is short for **pan**el **da**ta and is primarily used for data analysis. We will heavily rely on pandas to manipulate data. pandas requires numpy to be imported, so you will commonly see the two imported at the same time

In [None]:
import numpy
import pandas

In [None]:
# Alternaively, you can cut down on your typing by providing an alias for the library names
# Pandas own documentation uses the aliases below. 
# See https://pandas.pydata.org/docs/user_guide/10min.html for an example

import numpy as np
import pandas as pd

# Importing Files

A **file type** tells you how data is saved. Different file types specialize in saving data in certain ways. A great example is docx vs xlsx files. docx files are excellent at displaying large portions of text and images in a horizontal or vertical format. xlsx files are excellent at saving large portions of data in neat rows and columns so that you can perform functions on specific portions of the data. 

For Python to read a particular file type, you will need specialized functions that extract the information in ways we understand. This is a great example of where libraries can be incredibly useful! Here is a list of file types and the associated library you would want to install to work with the files. *run "pip install ____" in a terminal or command prompt to install the file

- **.docx**: python-docx
- **.csv**: pandas
- **.xlsx**: pandas
- **.pdf**: pymupdf

In our course, we will focus solely on csv and xlsx files as these are the most common Data Science file types you will need to manipulate.

# Import xlsx using pandas

Let's get some government data to learn how to use pandas. 

1. From https://www.bls.gov/oes/tables.htm, download the most recent "All data" XLSX file. This will download a zip file to your Downloads folder. 
2. Unzip the folder. 
3. Move the file named "all_data_M_20XX.xlsx" (XX will be the last two digits of the year you chose) to the folder where you run your jupyter notebooks. 

In [None]:
import numpy
import pandas

# For this example, I downloaded May 2022 data
# This will take 1-2 minutes as there are 400,000+ rows of data!
dataframe = pandas.read_excel("all_data_M_2022.xlsx")

In [None]:
# .head() prints the first 5 rows of the dataframe
# Very useful for reading the column names and seeing the first few rows
dataframe.head()

### Some notes:
**columns** are bolded across the top. Be sure to find descriptions of each column name when you download databases. For example, you can find the description of all OCC_CODE and OCC_TITLE here: https://www.bls.gov/oes/current/oes_stru.htm

**NaN** means Not a Number. This commonly happens when the cell is left blank in an xlsx document. 

The size of the dataframe is in the bottom-left corner. Since we only printed the head, it is showing 5 rows. Try removing the .head() and see how many rows the actual dataframe has.

## Venn Diagram of Merging

StackOverflow has an excellent discussion of different types of merging WITH VISUALS: https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join

Key Visual Summary: https://i.stack.imgur.com/hMKKt.jpg
- **Concatenate** - Return all rows with NaN for missing data
- **Inner Merge** - Return rows with column matches in both dataframes
- **Outer Merge** - Return rows with column matches in either dataframe

## Concatenate

The idea to concatenating is to take the two dataframes and stack them on one another. Any missing data (columns defined in one dataframe and not in the other) are treated as NaN. **There is no attempt to match and exclude data.** 

In [None]:
df_1 = pandas.DataFrame({'col_1': [1, 2], 'col_2': [3, 4]})
df_2 = pandas.DataFrame({'col_1': [11, 12], 'col_3': [13, 14]})
concat_df = pandas.concat([df_1, df_2])

concat_df

Notice that some values were changed from integers to float. We will deal with that in a future lesson.

## Inner Merge

In [None]:
df_1 = pandas.DataFrame({'col_1': [1, 2], 'col_2': [3, 4], 'col_3': [13, 14]})
df_2 = pandas.DataFrame({'col_1': [1, 2], 'col_3': [3, 24]})
inner_merge_df = pandas.merge(df_1, df_2, how="inner")

inner_merge_df

Since there are no perfectly matched rows, the output is empty. But what if we wanted to check for partial matches? We can define the columns we want to merge on with **on=[]**.

In [None]:
df_1 = pandas.DataFrame({'col_1': [1, 2], 'col_2': [3, 4], 'col_3': [13, 14]})
df_2 = pandas.DataFrame({'col_1': [1, 12], 'col_2': [100, 200], 'col_3': [13, 14]})
inner_merge_df = pandas.merge(df_1, df_2, how="inner", on=['col_1', 'col_3'])
# Since the first row has col_1 = 1 for both dataframes, they are merged

inner_merge_df

This is a useful trick when looking for partial matches. Notice we had matches for col_1 and col_3? We did an inner merge and the conflicts for df_1 and df_2 are saved as col_2_x and col_2_y. 

## Outer Merge

In [None]:
df_1 = pandas.DataFrame({'col_1': [1, 2], 'col_2': [3, 4], 'col_3': [13, 14]})
df_2 = pandas.DataFrame({'col_1': [1, 12], 'col_2': [100, 200], 'col_3': [13, 14]})
outer_merge_df = pandas.merge(df_1, df_2, how="outer", on=['col_1'])

outer_merge_df

Notice that we defined an outer merge on col_1. This means our match of col_1 = 1 was combined, with the conflicting values of the other columns listed. The other two rows are included since they have values defined for col_1. 

In [None]:
df_1 = pandas.DataFrame({'col_1': [1, 2], 'col_2': [3, 4], 'col_3': [13, 14]})
df_2 = pandas.DataFrame({'col_1': [1, 12], 'col_3': [13, 14]})
outer_merge_df = pandas.merge(df_1, df_2, how="outer")

outer_merge_df

Notice how row 1 of df_2 is not included since it matches everywhere it is defined with row 1 of df_1.

In [None]:
df_1 = pandas.DataFrame({'col_1': [1, 2], 'col_2': [3, 4]})
df_2 = pandas.DataFrame({'col_1': [11, 12], 'col_3': [13, 14]})
outer_merge_df = pandas.merge(df_1, df_2, how="outer")

outer_merge_df

When there are no matching rows, an outer merge will look like a concatenate. **The main difference here is outer merge TRIES to combine copies while concatenate does not!**

## Note

If you looked at the key visual summary, you'll see there are more ways to combine data than concatenate, inner merge, and outer merge. However, these will be the three most common ways to merge that you will use in the course.