## Author: Dere, Abdulhameed Abiola

## DATA IMPORTATION

While working in the industry, it is very unlikely that you will always get data that is structured and in the conventional CSV format. It means that as a Data Scientist, you should have the skills of importation irrespective of the source.




Common formats in which data can be available include:
1.  CSV,
2. TXT,
3. XLS/XLSX (Excel), 
4. sas7bdat (SAS),
5. Stata, 
6. Rdata (R) etc. 

Loading data in python environment is the most initial step of analyzing data.

While importing external files from any source, we need to look out for the following important points:

1. Check whether header row exists or not
2. Treatment of special values as missing values
3. Consistent data type in a variable (column)
4. Date Type variable in consistent date format.
5. No truncation of rows while reading external data

# Step One

Install and Load pandas package. If you are using Anaconda, pandas must have been installed and you'd probably have used it for some python modelues. You then need to load the package by using the command below

import pandas as pd 

# Step Two

## 1. Import the CSV file

Note that a singlebackslash does NOT always work when specifying the file path. You need to either change it to forward slash or add one more backslash like below

In [22]:
import pandas as pd

# data = pd.read_csv('C:\Users\HP\Desktop\codellc\DSN\2006.csv')

In [23]:
# Using a double forwardslash
data = pd.read_csv('C:\\Users\\HP\\Desktop\\codellc\\DSN\\2006.csv')

In [24]:
# Using a single backslash
data = pd.read_csv('C:/Users/HP/Desktop/codellc/DSN/2006.csv')

In [25]:
# Using regex
data = pd.read_csv(r'C:\Users\HP\Desktop\codellc\DSN\2006.csv')

In [31]:
df = pd.read_csv('Loan prediction train.csv')

df.head(2)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N


## If there is no header (title) in the raw data file, then you will need to specify that to python

In [34]:
df2 = pd.read_csv('Loan prediction train.csv', header = None)

df2.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001002,Male,No,0,Graduate,No,5849,0,,360,1,Urban,Y


## You can also add column names as it suites you

In [39]:
df3 = pd.read_csv('Loan prediction train.csv', header = None, names=['ID','Sex','Relationship','Dependants','School_level','Employment','Income','Co_income','LoanAmount','Term','Credit','Property','Status'])

df3.head(2)

Unnamed: 0,ID,Sex,Relationship,Dependants,School_level,Employment,Income,Co_income,LoanAmount,Term,Credit,Property,Status
0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001002,Male,No,0,Graduate,No,5849,0,,360,1,Urban,Y


## The column names can also be added separately using the column menthod of pandas 

In [42]:
df2.columns = ['ID','Sex','Relationship','Dependants','School_level','Employment','Income','Co_income','LoanAmount','Term','Credit','Property','Status']

In [43]:
df2

Unnamed: 0,ID,Sex,Relationship,Dependants,School_level,Employment,Income,Co_income,LoanAmount,Term,Credit,Property,Status
0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001002,Male,No,0,Graduate,No,5849,0,,360,1,Urban,Y
2,LP001003,Male,Yes,1,Graduate,No,4583,1508,128,360,1,Rural,N
3,LP001005,Male,Yes,0,Graduate,Yes,3000,0,66,360,1,Urban,Y
4,LP001006,Male,Yes,0,Not Graduate,No,2583,2358,120,360,1,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
610,LP002978,Female,No,0,Graduate,No,2900,0,71,360,1,Rural,Y
611,LP002979,Male,Yes,3+,Graduate,No,4106,0,40,180,1,Rural,Y
612,LP002983,Male,Yes,1,Graduate,No,8072,240,253,360,1,Urban,Y
613,LP002984,Male,Yes,2,Graduate,No,7583,0,187,360,1,Urban,Y


## 2. Import File from URL

There is nothing really different while import from a url and importing a csv file. All you need to do is to simply put URL in read_csv() function

NOTE: This is only applicable ONLY for CSV files stored in URL

In [53]:
url_df = pd.read_csv('http://winterolympicsmedals.com/medals.csv')

url_df.head(2)

Unnamed: 0,Year,City,Sport,Discipline,NOC,Event,Event gender,Medal
0,1924,Chamonix,Skating,Figure skating,AUT,individual,M,Silver
1,1924,Chamonix,Skating,Figure skating,AUT,individual,W,Gold


## 3. Import Text File

We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "t" to read data from tab-separated file.

In [46]:
text_df = pd.read_csv('pop.txt', sep= "\t")

text_df.head(2)

Unnamed: 0,Rank,State,Population
0,1,Kano State,9401288
1,2,Lagos State,9113605


## 4. Read Excel File

The read_excel() function can be used to import excel data into python

In [49]:
excel_df = pd.read_excel('Copy of 2006.xlsx')

excel_df.head(2)

Unnamed: 0,STATES,AREA (km2),Population
0,Abia State,6320,2845380
1,Adamawa State,36917,3178950


## Other files that can be read include:

1. SAS datafile using read_sas() function
2. Stata datafile using read_stata() function
3. SQL Table 

### For more information on importing and reading different types of files, check out the link below:

https://www.listendata.com/2017/02/import-data-in-python.html#Import-CSV-files