# Importing flat files in Python

As a Data Scientist, on a daily basis you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before doing any of these, however, you will need to know how to get data into Python. In this article, you will learn how to import data from flat files into Python.

## Reading a plain text file

To check out any plain text file you can use python's basic open function to open a connection to a file.
To do so follow following steps.
- Assign a filename to a variable of type string
- Pass the file name to function open and also pass the argument mode = 'r' which makes sure that we can only read the file
- Assign the text of the file to a variable by calling read() method on the connection of the file
- After you do this make sure that you close the connection to the file

It is always a best practice *to clean while cooking*

In [50]:
filename = 'seaslug.txt'

In [51]:
file = open(filename,mode='r')

In [52]:
text = file.read()

In [53]:
file.close()

In [54]:
# print the file content which you have just read
print(text)

YOU don't know about me without you have read a book by the
name of The Adventures of Tom Sawyer; but that ain't no
matter. That book was made by Mr. Mark Twain, and he told
the truth, mainly. There was things which he stretched, but
mainly he told the truth. That is nothing. never seen
anybody but lied one time or another, without it was Aunt
Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt
Polly, she is--and Mary, and the Widow Douglas is all told
about in that book, which is mostly a true book, with some
stretchers, as I said before.


You can avoid having to close the connection to the file by writing *with* statement. This allows you to create a context in which you can read the file. Once the file is out of this context it is no longer open hence 'with' is called as context manager. 

In [55]:
with open('seaslug.txt','r') as file:
    print(file.read())

YOU don't know about me without you have read a book by the
name of The Adventures of Tom Sawyer; but that ain't no
matter. That book was made by Mr. Mark Twain, and he told
the truth, mainly. There was things which he stretched, but
mainly he told the truth. That is nothing. never seen
anybody but lied one time or another, without it was Aunt
Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt
Polly, she is--and Mary, and the Widow Douglas is all told
about in that book, which is mostly a true book, with some
stretchers, as I said before.


## Reading a flat file (csv)

Flat files are basic text files containing records i.e.table data without structured relationship. This is in contrast to the relational database in which columns of dictinct tables may be related.

The flat file which we are going to consider here titanic_sub.csv which has rows of data where each row is a unique passenger onboard and each column is a feature or attribute.

Flat file can have a header which is a row that occurs as a first row which tell the columns of the dataset.It will be important to know whether your file has a header or not.

The file extention is .csv which stands for 'Comma Separated Values'. It means exactly what is says i.e. the values are each separted by a comma. The file can be separated by tab as well. Tab, commas these are called as demileters.

If the file contains only numeric data then we can use numpy to import it. If the file contains combination of numeric and string data then we use pandas for importing.

**Importing flat files using numpy**

Lets consider the file which has all numeric data. File name is MNIST.txt.

In [56]:
#imprt numpy first
import numpy as np

In [57]:
data = np.loadtxt('MNIST.txt',delimiter=',')

In [58]:
data

array([[ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 2.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 5.,  0.,  0., ...,  0.,  0.,  0.]])

Default delimeter is a white space so we need to specify the delimeter specifically if it is other than a white space.

loadtxt() tends to break down when we have mixed data type in the file. For this we use genfromtxt().

In [59]:
data = np.genfromtxt('titanic_sub.csv', delimiter=',', names=True, dtype=None)

Here, the first argument is the filename, the second specifies the delimiter , and the third argument names tells us there is a header. Because the data are of different types, data is an object called a structured array. Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported.If we pass dtype=None to it, it will figure out what types each column should be.

In [60]:
np.shape(data)

(891,)

I have just used np.genfromtxt() to import data containing mixed datatypes. There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None

In [61]:
d = np.recfromcsv('titanic_sub.csv')

In [62]:
#print first 3 entries of d
print(d[:3])

[(1, 0, 3, b'male',  22., 1, 0, b'A/5 21171',   7.25  , b'', b'S')
 (2, 1, 1, b'female',  38., 1, 0, b'PC 17599',  71.2833, b'C85', b'C')
 (3, 1, 3, b'female',  26., 0, 0, b'STON/O2. 3101282',   7.925 , b'', b'S')]


**Importing flat files using pandas**

Although arrays are incredibaly powerful and serve a number of essential purpose but they can not fulfill one of the basic needs of a Data Scientist
- Two dimentional labeled data structure(s)
- Columns of potentially different types
- Manipulate, slice, reshape, groupby, join, merge the dataframe
- Perform statistics
- Working with time series

For this we need pandas library which helps you to carry out entire data analysis workflow in Python.

Pandas offer data structure called as *Dataframe* which helps in analysis.

Manipulating dataframes in pandas can be useful in all steps inlcuding following:
- Exploratory data analysis
- Data wrangling
- Data Preprocessing
- Building models
- Visualization

For all of this reasons, it is standard and best practice to use pandas to import flat files as dataframes.

In [63]:
#first import pandas
import pandas as pd

In [64]:
#call read_Csv() function to import into dataframe
df = pd.read_csv('titanic_sub.csv')

In [65]:
#check the fist 5 entries by using head()
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S


In [66]:
#we can convert a dataframe to numpy array as follows
df_array = df.values

In [67]:
type(df_array)

numpy.ndarray