<a href="https://colab.research.google.com/github/Imppel-9704/de_track_datacamp/blob/main/Importing_data_in_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing data
- Flat files e.g. .txt, .csv
- Files from other software. (Excel spreadsheet, Stata, SAS and MATLAB files.)
- Relational database.

## Reading a text file
```
filename = 'huck_finn.txt'
file = open(filename, mode='r') # 'r' is to read
text = file.read()
file.close()

# then can print to check it
print(text)
```

## Writting to a file
```
filename = 'huck_finn.txt'
file = open(filename, mode='w') # 'w' is to write
file.close()
```

## Avoiding to close the connection to the file
```
with open('huck_finn.txt', 'r') as file:
  print(file.read())
```

## Practice
```
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
  print(file.readline())
  print(file.readline())
  print(file.readline())
```

## Flat files
- Text files containing records.
- That is, table data.
- Record: row of fields or attributes.
- Column: feature or attributes. (column names.)

*File extension*
- .csv - comma separated values.
- .txt - text file.
- Values in flat files can be separated by commas or tabs - delimiters.

# Importing flat files using NumPy

## Why Numpy?
- Numpy arrays: stadard for storing numerical data. (fast and clean.)
- Essential for other package e.g. scikit-learn (ml package for python.)
- Have numer built-in function (easier and efficient for import data as array.)
  - loadtxt()
  - genfromtxt()

## Importing flat files using Numpy
```
import numpy as np
filename = 'MNIST.txt'
data = np.loadtext(filename, delimiter=',')
data
```

## Customizing Numpy import
```
import numpy as np
filename = 'MNIST_head.txt'
data = np.loadtext(filename, delimiter=',', skiprows=1)
print(data)
```

```
# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
# Skip the first skiprows lines, including comments; default: 0.
# usecols takes a list of the indices of the columns you wish to keep.
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0, 2])

# Print data
print(data)
```

## Importing .csv files using Numpy
```
# Import 'titanic.csv' using the function np.genfromtxt()

data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
```

```
# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d
d = np.recfromcsv(file, delimiter=',', dtype=None)

# Print out first three entries of d
print(d[:3])

```

## Importing .csv files using Pandas

## Manipulating pandas DataFrames
- Exploratory data analysis.
- Data wragling.
- Data preprocessing.
- Building models.
- Visualization.
- Standard and best practice to use Pandas.

```
# Import pandas as pd
import pandas as pd

# Assign the filename: file
file = 'titanic.csv'

# Read the file into a DataFrame: df
df = pd.read_csv(file)

# View the head of the DataFrame
print(df.head())
```

```
# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array = np.array(data)

# Print the datatype of data_array to the shell
print(type(data_array))
```

```
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

```

# Introduction to other file types.
- Excel spreadsheet.
- MATLAB files.
- SAS.
- Stata files.
- HDF5 files.

## Pickled files
- file type native to Python.
- Motivation: many datatype for which it isn't obvious how to store them.
- Pickled files are serialized.
- Serialize = convert object to bytestream.

```
# Import pickle package
import pickle

# Open pickle file and load data: d
with open('data.pkl', 'rb') as file:
    d = pickle.load(file)

# Print d
print(d)

# Print datatype of d
print(type(d))

##output
{'June': '69.4', 'Aug': '85', 'Airline': '8', 'Mar': '84.4'}
<class 'dict'>
```

## Excel spreadsheet
```
# Import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = 'battledeath.xlsx'

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

# Print sheet names
print(xls.sheet_names)

##output
['2002', '2004']
```

```
# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')

# Print the head of the DataFrame df1
print(df1.head())

# Load a sheet into a DataFrame by index: df2 (Load first sheet using index)
df2 = xls.parse(0)

# Print the head of the DataFrame df2
print(df2.head())
```

```
# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])

# Print the head of the DataFrame df1
print(df1.head())

# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])

# Print the head of the DataFrame df2
print(df2.head())
```


# Importing SAS/Stata files using pandas

## SAS and Stata files
- SAS: Statistic Analsis System
- Stata: "Statistic" + "data"
- SAS: The former is used a great deal in business analytics and biostatistics.
- Stata: Popular in academic social science research.

*SAS files*
- Used for:
  - Advanced analytics.
  - Multivariate analysis.
  - Business intelligence.
  - Data management.
  - Predictive analytics.
  - Standard for computational analysis.

```
# Importing SAS files
import pandas as pd
from sas7bdat import SAS7BDAT
with SAS7BDAT('urbanpop.sas7bdat') as file:
  df_sas = file.to_data_frame()
```

```
# Importing Stata files
import pandas as pd
data = pd.read_stata('urbanpop.dta')
```

# Importing HDF5 files

## HDF5
- Hierarchical Data Format version 5
- Standard for storing large quantities of numerical data.
- Datasets can be hundreds of gigabytes or terabytes.
- HDF5 can scale to exabytes.

```
# Importing HDF5 files
import h5py
filename = 'H-H1_LOSC_4_V1_815411200-4096.hdf5'
data = h5py.File(filename, 'r') # 'r' is to read

for key in data.keys():
  print(key) # There are 3 keys: meta, quality, strain

for key in data['meta'].keys():
  print(key)
  print(np.array(data['meta']['Description']), np.array(data['meta']['Detector']))
```

# Importing MATLAB files

## MATLAB
- "Matrix Laboratory"
- Industry standard in engineering and science.
- Data saves as .mat files.

## SciPy to the rescue
- scipy.io.loadmat() - read .mat files.
- scipy.io.savemat() - write .mat files.

```
# Importing .mat files
import scipy.io
filename = "workspace.mat"
mat = scipy.io.loadmat(filename)
print(type(mat))
## output
<class 'dict'>
```

- keys = MATLAB variable names.
- values = objects assigned to variables.

```
print(type(mat['x']))
## output
<class 'numpy.ndarray'>
```

# Introduction to relational databases
## The relational model
- Each row or record in a table represents an instance of an entity type.
- Each column in a table represents an attribute or feature of an instance.
- Every table contains a primary key column, which has a unique entry for each row.

## Creating a database engine in Python
Choosing to useSQLite database. (Fast and Simple)

Library to connect with SQLite is "SQLAlchemy"
```
from sqlalchemy import create_engine

engine = create_engine('sqlite:///Chinook.sqlite')
# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names))
```
Function create_engine will communicate with our queries to the database.

## Workflow of SQL querying
1. Import packages and functions.
2. Create the database engine.
3. Connect to the engine.
4. Query the database.
5. Save query results to a DataFrame.
6. Close connection.

Example:
```
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///chinook.sqlite')
con = engine.connect()
rs = con.execute("SELECT * FROM table_name")
df = pd.DataFrame(rs.fetchall())
con.close()
```

Using the context manager:
```
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('sqlite:///chinook.sqlite')

with engine.con() as con:
  rs = con.execute("SELECT * FROM table_name")
  df = pd.DataFrame(rs.fetchmany(size=5)) # size=5 means imports 5 rows
  df.columns = rs.keys()
```

```
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine connection: con
con = engine.connect()

# Perform query: rs
rs = con.execute("SELECT * FROM Album")

# Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())

# Close connection
con.close()

# Print head of DataFrame df
print(df.head())
```


```
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT LastName, Title FROM Employee")
    df = pd.DataFrame(rs.fetchmany(size=3))
    df.columns = rs.keys()

# Print the length of the DataFrame df
print(len(df))

# Print the head of the DataFrame df
print(df.head())
```

## Querying relational databases directly with pandas

```
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

df = pd.read_sql_query("SELECT * FROM Orders", engine)
```

## INNER JOIN in Python (Pandas)
```
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist ON Album.ArtistId = Artist.ArtistId")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print head of DataFrame df
print(df.head())
```