## I. Base Python Imports

### 1. Reading a text file

`filename = 'filename.txt'` <br>
`file = open(filename, mode='r')` <br>
`text = file.read()` <br>
`file.close()` <br>

In the above, variables can be defined in any way, `'filename.txt'` represents some file.

#### Reading a file in a context (no close required)

`with open(filename, mode='r') as file:` <br>
&emsp;&emsp; `print:file.read()` <br>

What you're doing here is called 'binding' a variable in the context manager construct; while still within this construct, the variable file will be bound to open(filename, 'r'). It is best practice to use the with statement as you never have to concern yourself with closing the files again.


#### Read and print individual lines

`with open('moby_dick.txt') as file:` <br>
&emsp;&emsp; `print(file.readline())` <br>
&emsp;&emsp; `print(file.readline())` <br>
&emsp;&emsp; `print(file.readline())` <br>

The above would read the first three lines of text.

### 2. Pickle

`import pickle` <br>
`with open('filename.pkl', 'rb') as file:` <br>
&emsp;&emsp;`data = pickle.load(file)`

> `'rb'` stands for read-only, binary

## II. Numpy Imports

#### 1. `np.loadtxt`
`import numpy as np` <br>
`filename = 'MNIST.txt`<br>
`data = np.loadtxt(filename, delimiter=',')`<br>
`data`

**NB:** `np.loadtxt` struggles with mixed data

> there are other arguments such as:
> > `skiprows=1` <br>
> > `usecols=[0,2]` <br>
> > `dtype=str` (ensures all data is imported as string)

#### 2. `np.genfromtxt`

`data = np.genfromtxt(filename, delimiter=',', names=True, dtype=None)`

`dtype=None` will figure out what types each column should be.
`names` tells us there is a header.

#### 3. `np.recfromcsv`

`data = np.recfromcsv(filename)` <br>

> the above operates the same was as `genfromtxt` but has defaults delimiter = ',' and names = True, as well as dtype = None.

## III. Pandas Imports

### 1. CSV Files
`import pandas as pd` <br>
`filename = 'file.csv` <br>
`data = pd.read_csv(filename)` <br>
`data.head()`

#### Customizing csv import
`sep` = the pandas version of `delimiter` <br>
`comment` = takes comments which occur after specified character (e.g., '#') <br>
`na_values` = takes a list of strings to recognize as NA/NaN (e.g., 'Nothing')

### 2. Excel spreadsheets
`file = 'filename.xlsx` <br>
`data = pd.ExcelFile(file)` <br>
`data.sheet_names` = returns sheets in the excel file <br>
`data.parse('sheetname')` = returns sheet specified by sheetname (as string) <br>
`data.parse(0)` = returns sheet specified by sheetname (index position as float)

#### Alternate excel import
`data = pd.read_excel(file, sheet_name=None)`
- `sheet_name=None` imports all sheets <br>
`data.keys()` returns the sheetnames <br>
`data['sheetname']` returns content from specified sheet

#### Customizing excel import
`skiprows` = select unwanted rows by passing in a list <br>
`names` = name the columns by passing in a list (e.g., 'Country')  <br>
`usecols` = designate which columns to parse (e.g., [0])

### 3. SAS and Stata

- **SAS:** Statistical Analysis System
    - used in business analytics and biostatistics <br>
- **Stata:** "Statistics" + "data"
    - used in academic social science research
    
#### SAS files:
- **Used for:**
    - Advanced analytics - Multivariate analysis - Business intelligence - Data management - Predictive analytics - Standard for computational analysis 
- **Most common extensions:**
    - `.sas7bdat` and `.sas7bcat`: dataset and catalog files, respectively.
    
#### Importing SAS files

`import pandas as pd` <br>
`from sas7bdat import SAS7BDAT` <br>
`with SAS7BDAT('filename.sas7bdat') as file:` <br>
&emsp;&emsp; `df_sas = file.to_data_frame()`

#### Importing Stata files
`import pandas as pd` <br>
`data = pd.read_stata('filename.dta')`
> <span style="color:indianred"> no context manager (i.e., `with`) required! </span>

### 4. HDF5 Files

"Hierarchical Data Format version 5"
- Standard for storing large quantities of numerical data


`import h5py` <br>
`filename = 'filename.hdf5'` <br>
`data = h5py.File(filename, 'r')` <br>

#### Exploring HDF5 files

`for key in data.keys():` <br>
&emsp;&emsp; `print(key)`

- _this returns hdf groups, which can be thought of as directories_

##### this continues down the structure:

`for key in data['groupname'].keys():`
&emsp;&emsp; `print(key)`

- _this returns the data in the group specified in place of 'groupname'_

##### to access content in the group:

`print(np.array(data['groupname']['subgroupname1']), np.array(data['groupname']['subgroupname2']))`

- _this converts the data to a numpy array and makes it accessible_

## IV. SciPy Imports

### 1. MatLab Files

"Matrix Laboratory"

**read .mat files:** <br>
`import scipy.io` <br>
`filename = 'filename.mat'` <br>
`mat = scipy.io.loadmat(filename)` 

> <span style="color:royalblue"> the type of this file is a `dict` </span>
> > <span style="color:royalblue"> `keys`=MATLAB variable names </span> <br>
<span style="color:royalblue"> `values`=objects assigned to variables </span>




`scipy.io.savemat()` = write .mat files



## V. Creating a Database Engine with SQLAlchemy

`from sqlalchemy import create_engine` <br>
` engine = create_engine('sqlite:///Northwind.sqlite')` <br>
>  use the function create_engine to fire up an SQL engine that will communicate our queries to the database. The only required argument of create_engine is a string that indicates the type of database you're connecting to and the name of the database. <br>
<br>
> <span style="color:indianred"> note also: `sqlite:///` required before database name. </span>


`table_names = engine.table_names()` <br>
`print(table_names)`

### A. Workflow of SQL querying

1. import packages and functions (above, starting with `from sql...`)
2. create the database engine (above, starting with `engine =`)
3. connect to the engine
4. query the database
5. save the query results to a DataFrame
6. close the connection

#### 3. connect to engine

`con = engine.connect()`

#### 4. query database
`rs = con.execute("SELECT * FROM Orders")`

#### 5. save query to DataFrame
`df = pd.DataFrame(rs.fetchall())` <br>
`df.columns = rs.keys()` 
> <span style="color:royalblue"> note: this ensures the df has the proper column names </span>
#### 6. close
`con.close()`

### B. Using Context Manager to Open Connection 
**This replaces steps 3-6 above**

`with engine.connect() as con:` <br>
&emsp;&emsp; `rs = con.execute("SELECT OrderID, OrderDate, ShipName FROM Orders")` <br>
&emsp;&emsp; `df = pd.DataFrame(rs.fetchmany(size=5))` <br>
&emsp;&emsp; `df.colmns = rs.keys()`

> <span style="color:royalblue"> `fetchmany` returns 5 rows from the database, while `fetchall` returns all rows </span>

### C. Pandas to Query
**This also replaces steps 3-6 above**

`df = pd.read_sql_query("SELECT * FROM Orders", engine")`

#### Advanced Query with Pandas

`df = pd.read_sql_query("SELECT OrderID, CompanyName FROM Orders
INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID", engine")`


## VI. Import Flat Files from Web

### 1. urllib package

- Provides interface for fetching data across the web <br>
> `urlopen()` accepts urls instead of filenames

#### save file locally & read into pd dataframe
`from urllib.request import urlretreive` <br>
`url = 'webaddress'` <br>
`urlretrieve(url, 'csvname.csv')` <br>
`df = pd.read_csv('csvname.csv', sep-';')`

#### without saving locally
`from urllib.request import urlretreive` <br>
`url = 'webaddress'` <br>
`df = pd.read_csv(url, sep-';')`

### 2. HTTP requests to import files from web
- protocol identifier: http: or https:
- resource name - data.com
            - http: "HyperText Transfer Protocol"
- Goint to a website = sending HTTP request
            - GET request
            
- `urlretrieve()` performs a GET request
- HTML "HyperText Markup Language

#### a. GET request using urllib
`from urllib.request import urlopen, Request` <br>
`url = 'url'`  <br>
`request = Request(url)`  <br>
`response = urlopen(request)`  <br>
`html = response.read()`  <br>
`response.close()`

#### b. GET requests using requests
`import requests`  <br>
`url = "url"`  <br>
`r = requests.get(url)`  <br>
`text = r.text`

## VII. Web Scraping with Python

**HTML:** 
- Mix of unstructured and structured data
- structured: pre-defined data model **or** organized in defnitive manner
- unstructured: neither of these properties

### BeautifulSoup

`from bs4 import BeautifulSoup` <br>
`import requests`<br>
`url = "url"`<br>
`r = requests.get(url)`<br>
`html_doc = r.text`<br>
`soup = BeautifulSoup(html_doc)`<br>
<br>
#### beautiful soup methods
`soup.prettify()`
> <span style= "color:royalblue"> prettified soup is properly indented, and thus much clearer </span>

`soup.title` extracts title<br>
`soup.get_text()`extracts title and text <br> 
<br> 
`for link in soup.find_all('a'):` <br>
&emsp;&emsp; `print(link.get('href'))` extracts all the hyperlinks

## VIII. APIs & JSONs

API = "Application Programming Interface"

#### JSONs
JSON = "JavaScript Object Notation"
- real-time server-to-browser communication
- human readable
- natural to store JSONs as a python **dict** due to key-value pairs
- all keys are strings in JSONs, values can be otherwise

### 1. Loading JSONs in Python
`import json` <br>
`with open('name.json', 'r') as json_file:` <br>
&emsp;&emsp; `json_data = json.load(json_file)`

> <span style= "color:royalblue"> note: the file type will be dict here </span>

**to print the key-value pairs to the console, iterate as if dict:** <br>
`for key, value in json_data.items():`<br>
&emsp;&emsp; `print(key + ':' + value)`

### 2. Apis and interacting with web

An API is a set of protocols and routines for building and interactive with software applications.
- Code that allows two software programs to interact with each other

#### code:
`import requests` <br>
`url = "apiurl"` <br>
`r = requests.get(url)` <br>
`json_data = r.json()` <br>
<br>
`for key, value in json_data.items():` <br>
&emsp;&emsp; `print(key + ':', value)`

#### example URL:
`http://www.omdbapi.com/?t=hackers`
- **http** = making an http request
- **www.omdbapi.com** = querying the omdb api
- **?t=hackers**
     - **?t** = Query string
     - **=hackers** = query we are macking to the api
     (i.e., 'return data for a move with the title 'Hackers'
     
### 3. Twitter API

- Account required

`import tweepy, json` <br>
`access_token = "..."` <br>
`access_token_secret = "..."` <br>
`consumer_key = "..."` <br>
`consumer_secret = "..."` <br>

#### create a streaming object

`stream = tweepy.Stream(consumer_key, consumer_secret, access_token, access_token_secret)`

#### filter twitter streams to capture data by keywords

`stream.filter(track=['apples', 'oranges'])`

#### example twitter data load:

```python
# Import package
import json

# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'

# Initialize empty list to store tweets: tweets_data
tweets_data = []

# Open connection to file
tweets_file = open(tweets_data_path, "r")

# Read in tweets and store in list: tweets_data
for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file.close()

# Print the keys of the first tweet dict
print(tweets_data[0].keys())
```

#### turning twitter data into DataFrame

```python
# Import package
import pandas as pd

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text', 'lang'])

# Print head of DataFrame
print(df.head())
```

#### Twitter text analysis

**function created for below analysis:**
```python
import re

def word_in_text(word, text):
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)

    if match:
        return True
    return False
```

**code:**
```python

[clinton, trump, sanders, cruz] = [0, 0, 0, 0]

# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
    clinton += word_in_text('clinton', row['text'])
    trump += word_in_text('trump', row['text'])
    sanders += word_in_text('sanders', row['text'])
    cruz += word_in_text('cruz', row['text'])
    
```
