# Importing flat files with pandas

**Why use pandas?**

Pandas support 2-D labelled data structures with columns of potentially different types, called **dataframes**.

Comes with a plethora of methods to manipulate those dataframes, such as:

- slice, merge, reshape, join, groupby, etc
- perform statistics, mean, standard deviation, etc.
- work with time series data.

Pandas allow you to perform a data analysis work flow in Python without having to switch to a more domain specific language like `R`.

Manipulating pandas dataframes are used in:

- exploritory data analysis(EDA)
- data wrangling
- data preprocessing
- building models
- data visulization

It's standard(and best) practice to import flat files directly into pandas datframes.

To import a `csv` file, use the pandas `read_csv()` method. 

- Use the `nrows=` to import a set number of rows of data.
- Use `header=None` when the file does not have a header.
- Use `sep=` to specify the delimiter.
- Use `comment` to specfiy comments, e.g. `comment='#'`, so they're removed.
- Use `na_values` setting it to a list of strings to recognise as missing values which will be replaced with `NA` or `NaN`.

We can convert this dataframe to a numpy array by calling the `values` attribute of the dataframe.

In [6]:
import pandas as pd

df = pd.read_csv('data/mnist.csv', nrows=5, header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,784
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
# convert a dataframe into a numpy array
np_array = df.values
np_array

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [4., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

The pandas package is also great at dealing with many of the issues you will encounter when importing data, such as comments occurring in flat files, empty lines and missing values. Note that missing values are also commonly referred to as `NA` or `NaN`

In [15]:
data = pd.read_csv('data/titanic_corrupt.txt', sep='\t', comment='#', na_values=['Nothing', ' '], nrows=5)
data

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,,,
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S
