# Importing flat files

Flat files are text files containg a single table, without any structured relationships. As apposed to a relational database which consists of multiple tables that can be related. 

The table consists of records, where each is a row of fields or attributes(features) which at most contain one item of information, relating to a unique sample. `csv` files are examples of flat files.

In the `titanic.csv` file, each row is a unique passenger, and each column is a feature or attribute, e.g. name, age, cabin, etc.

A flat file may have a `header`, as in `csv` files. It is the first row and describes the contents of the data columns.

`txt` and `csv` files are examples of flat files.

Values in flat files can be separated by a number of different `delimites`, e.g. commas, space, tab, etc.

We generally import them using `NumPy` if you want to manipulate the data as an array, or using `Pandas` if you want to manipulate a dataframe.

## Importing Flat Files with NumPy

Why use NumPy?

NumPy arrays are the Python standard for storing numerical data since they're efficient and fast.

NumPy arrays are sometimes required by other packages, e.g. `SckitLearn`.

NumPy has a number of methods, such as `loadtxt()` and `genfromtxt()` which make importing flat file easy.

### loadtxt()

- Default delimiter is white space, ` `.
- By default will import **numerical** only data, interpreter will raise a `ValueError` otherwise.
- Use the attribute `dtype=str` to import strings.
- if your file has a header with string column names, you can skip it with `skiprows=1` attribute. You can specify how many rows you wish to speak, 1 being the 1st.
- specify specific columns to be imported using `usecols=[col1, col2, col4, etc]`, assigning a list of the column indices.

In [10]:
import numpy as np

data = np.loadtxt('data/mnist.txt', delimiter=',')
data

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [2., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [5., 0., 0., ..., 0., 0., 0.]])

In [14]:
# skip the header row, tab delimeter
data = np.loadtxt('data/seaslug.txt', delimiter='\t', skiprows=1)
data[0:5] # first 5 rows

array([[9.90e+01, 6.70e-02],
       [9.90e+01, 1.33e-01],
       [9.90e+01, 6.70e-02],
       [9.90e+01, 0.00e+00],
       [9.90e+01, 0.00e+00]])

In [19]:
# retrieve the 1st, 3rd, 5th and 9th columns - use the column indecies
data = np.loadtxt('data/mnist.txt', delimiter=',', usecols=[0, 2, 4, 8])
data[0:5]

array([[1., 0., 0., 0.],
       [0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [4., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [21]:
# import strings
data = np.loadtxt('data/dummy.txt', delimiter=',', dtype=str)
data

array([["'werwer'", "'werwer'", "'werwer'", "'werwer'", "'werwer'",
        "werwerwer'"],
       ["'werwerw'", "'werwer'", "'wrwerwer'", "'werwer'", "'werwer'",
        "'ewrwer'"],
       ["'werwer'", "'werwer'", "'werwer'", "'werwerwer'", "'werwerwer'",
        "'wrewer'"]], dtype='<U11')

In [27]:
# mixed numbers and strings are all imported as strings when 'dtype=str'
data = np.loadtxt('data/mixed.txt', delimiter=',', dtype=str)
data

array([["'werwer'", '3', "'werwer'", '5', "'werwer'", "'werwer'",
        "'werwer'", "werwerwer'"],
       ["'werwerw'", '3', "'werwer'", '6', "'wrwerwer'", "'werwer'",
        "'werwer'", "'ewrwer'"],
       ["'werwer'", '3', "'werwer'", '8', "'werwer'", "'werwerwer'",
        "'werwerwer'", "'wrewer'"]], dtype='<U11')

In [30]:
# mixed numbers and string csv file
data = np.loadtxt('data/titanic.csv', skiprows=1, delimiter=',', dtype=str)
data

array([['1', '0', '3', ..., '7.25', '', 'S'],
       ['2', '1', '1', ..., '71.2833', 'C85', 'C'],
       ['3', '1', '3', ..., '7.925', '', 'S'],
       ...,
       ['889', '0', '3', ..., '23.45', '', 'S'],
       ['890', '1', '1', ..., '30.0', 'C148', 'C'],
       ['891', '0', '3', ..., '7.75', '', 'Q']], dtype='<U18')

In [31]:
# load csv of numerical data
data = np.loadtxt('data/mnist.csv', delimiter=',')
data

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [2., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [5., 0., 0., ..., 0., 0., 0.]])

In [39]:
# import the same file as either string or numerical data
file = 'data/seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Import data as floats, tab-delimited and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
np.shape(data_float)

(47, 2)

### genfromtxt()

When dealing with files with mixed datatypes, a more robust option is to use numpy's `genfromtxt()` method and set `dtype=None`.  It will figure out what types each column should be.

The `names=True` argument tells numpy the file contains a header.

Whereas `.loadtxt()` returns a **numpy array**, `genfromtxt()` generates a **structured array**. Numpy arrays must contain elements of all the same data type. A **structured array** is a 1-D array where each element is a row imported from the flat file.  You can test this by checking out the array's shape in the shell by executing `np.shape(data)`.

Accessing rows and columns of structured arrays is the same as accessing an alament in a list: to get the `ith` row, merely execute `data[i`] and to get the column with name `'Fare'`, execute `data['Fare']`.

In [41]:
# importing titanic data
data = np.genfromtxt('data/titanic.csv', delimiter=',', names=True, dtype=None, encoding=None)
data[0:5]

array([(1, 0, 3, 'male', 22., 1, 0, 'A/5 21171',  7.25  , '', 'S'),
       (2, 1, 1, 'female', 38., 1, 0, 'PC 17599', 71.2833, 'C85', 'C'),
       (3, 1, 3, 'female', 26., 0, 0, 'STON/O2. 3101282',  7.925 , '', 'S'),
       (4, 1, 1, 'female', 35., 1, 0, '113803', 53.1   , 'C123', 'S'),
       (5, 0, 3, 'male', 35., 0, 0, '373450',  8.05  , '', 'S')],
      dtype=[('PassengerId', '<i8'), ('Survived', '<i8'), ('Pclass', '<i8'), ('Sex', '<U6'), ('Age', '<f8'), ('SibSp', '<i8'), ('Parch', '<i8'), ('Ticket', '<U18'), ('Fare', '<f8'), ('Cabin', '<U15'), ('Embarked', '<U1')])

In [42]:
np.shape(data)

(891,)

In [44]:
data['Pclass'][0:5]

array([3, 1, 3, 1, 3])

### recfromcsv()

You'll only need to pass the file to it because it has the defaults `delimiter=','`,  `names=True` and `dtype=None`.

In [46]:
data = np.recfromcsv('data/titanic.csv', encoding=None)
data[:5]

rec.array([(1, 0, 3, 'male', 22., 1, 0, 'A/5 21171',  7.25  , '', 'S'),
           (2, 1, 1, 'female', 38., 1, 0, 'PC 17599', 71.2833, 'C85', 'C'),
           (3, 1, 3, 'female', 26., 0, 0, 'STON/O2. 3101282',  7.925 , '', 'S'),
           (4, 1, 1, 'female', 35., 1, 0, '113803', 53.1   , 'C123', 'S'),
           (5, 0, 3, 'male', 35., 0, 0, '373450',  8.05  , '', 'S')],
          dtype=[('passengerid', '<i8'), ('survived', '<i8'), ('pclass', '<i8'), ('sex', '<U6'), ('age', '<f8'), ('sibsp', '<i8'), ('parch', '<i8'), ('ticket', '<U18'), ('fare', '<f8'), ('cabin', '<U15'), ('embarked', '<U1')])