# Introduction to Importing Data in Python

## Introduction and Flat files

### Importing Flat files using NumPy

Why NumPy? 
- NumPy arrays: standard for storing numerical data
- Essential for other packages: e.g. scikit-learn

In [None]:
import numpy as np
filename = 'MNIST.txt'
data = np.loadtxt(filename, delimiter=',', skiprows=1, usecols=[0,2], dtype = str)

- skiprows: how many rows(not indices) you wish to skip
- usecols: list of the indices of the columns you wish to keep
- dtype: will ensure that all entries are imported into the data type you set it to  
note: for tab-delimited use \t

loadtxt() is great for basic cases, but tends to break down when we have mixed datatypes  
ex: columns consisting of floats and columns consisting of str

### Importing Flat files using Pandas

In [None]:
import pandas as pd
filename = 'winequality-red.csv'
data = pd.read_csv(filename)
data.head()
data_array = data.to_numpy()

data.head(): Checks the first 5 rows of the Dataframe including the header  
data.to_numpy(): Converts the dataframe into a numpy array  

There are a number of arguments that pd.read_csv() takes that you'll find useful for this exercise:

- nrows: allows you to specify how many rows to read from the file. For example, nrows=10 will only import the first 10 rows.
- header: accepts row numbers to use as the column labels and marks the start of the data. If the file does not contain a header row, you can set header=None, and pandas will automatically assign integer column labels starting from 0 (e.g., 0, 1, 2, ..)

In [None]:
# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows = 5, header = None)

# Build a numpy array from the DataFrame: data_array
data_array = data.to_numpy()

# Print the datatype of data_array to the shell
print(type(data_array))

Another key arguments for pd.read_csv include:
- sep: sets the expected delimiter.  
You can use ',' for comma-delimited.  
You can use '\t' for tab-delimited.  

- comment: takes characters that comments occur after in the file, indicating that any text starting with these characters should be ignored.  

- na_values: takes a list of strings to identify as NA/NaN. By default, some values are already recognized as NA/NaN. Providing this argument will supply additional values.

In [None]:
# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values=['Nothing'])

# Print the head of the DataFrame
print(data.head())