# Exercise 1 : Getting to Know your Data
Author : Burhan Abbasi 

Date : 26 January 2018

Python version 3.7

Disclaimer : This series of tutorials will use data hosted on Kaggle. For the purpose of this course a copy has been downloaded from the site. For latest data visit the link below https://www.kaggle.com/timoboz/stock-data-dow-jones

## Step 1: File Count & File Formats

Location of data files relative to code files is 'Data_Set/stock-data-dow-jones/' 
We want to import the files but first we may want to know how many files are in the storage, also what is the format? Are some files in csv and other in xlsx?

In [7]:
 # list containing the names of the entries in the directory given by path
from os import listdir 
dir_path ='Data_Set/stock-data-dow-jones/'
files_list = [f for f in listdir(dir_path)]

In [2]:
print('Total files in storage location: ', len(files_list) ) #file count
print('First 5 File Names are:',files_list[:5]) #prints first 5 file names
print('Last 5 File Names are: ',files_list[-5:]) #prints last 5 file names

Total files in storage location:  30
First 10 File Names are: ['AAPL.csv', 'AXP.csv', 'BA.csv', 'CAT.csv', 'CSCO.csv']
Last 10 File Names are:  ['V.csv', 'VZ.csv', 'WBA.csv', 'WMT.csv', 'XOM.csv']


Identifying the formats.
Since most files ended with CSV, we extract last 3 letters of each name

In [4]:
_formats = [f[-3:] for f in files_list]
_formatsSet = set(_formats)
print('Number of data formats used in dataset: ',len(_formatsSet))
print('Formats used in data: ', _formatsSet) 

Number of data formats used in dataset:  1
Formats used in data:  {'csv'}


##### Findings : We now know the number of files in the directory & formats of files.

## Step 2: Reading a File
We will use Pandas for reading files and preprocessing of data

In [5]:
import pandas as pd   #https://pandas.pydata.org/
print(pd.__version__) #check version of Pandas

0.24.0


In [12]:
dataFrame1= pd.read_csv(dir_path+files_list[0])  #read a csv file
print(dataFrame1.columns)       #view attributes of the data
print(dataFrame1.shape)         #information about the number of records and attributes in data 
dataFrame1.head()               #view top 5 rows of dataframe to get an idea of what the data looks like

Index(['date', 'open', 'high', 'low', 'close', 'volume', 'unadjustedVolume',
       'change', 'changePercent', 'vwap', 'label', 'changeOverTime'],
      dtype='object')


Unnamed: 0,date,open,high,low,close,volume,unadjustedVolume,change,changePercent,vwap,label,changeOverTime
0,2014-01-27,72.1199,72.7401,71.5535,72.1763,144219152,20602736,0.580818,0.811,70.6687,"Jan 27, 14",0.0
1,2014-01-28,66.7037,67.5219,65.8266,66.4074,266833581,38119083,-5.7689,-7.993,66.7869,"Jan 28, 14",-0.079928
2,2014-01-29,66.0731,66.5215,65.3743,65.6535,125942796,17991828,-0.753887,-1.135,65.8259,"Jan 29, 14",-0.090373
3,2014-01-30,65.8882,66.4074,65.1226,65.5266,169762789,24251827,-0.126916,-0.193,71.9614,"Jan 30, 14",-0.092131
4,2014-01-31,64.9233,65.7558,64.7096,65.6339,116336444,16619492,0.107251,0.164,71.6528,"Jan 31, 14",-0.090645


##### Q1. How to identify that all files have same columns? Write your solution below?

## Step 3: Reading All Files Into Memory

In [15]:
list_ = []
for file_ in files_list:
    df = pd.read_csv(dir_path+file_,index_col=None, header=0)
    list_.append(df)

data= pd.concat(list_, axis = 0, ignore_index = True)      #Concatenating all files 

##### Q2. How many rows & columns in variable 'data'? Write your solution below?

Now that we have the data, lets see what we have by extracting a sample from dataframe

In [23]:
data.sample(5)

Unnamed: 0,date,open,high,low,close,volume,unadjustedVolume,change,changePercent,vwap,label,changeOverTime
16705,2014-01-28,32.9922,33.2517,32.8645,33.0773,8931212,8931212,0.119135,0.361,33.0718,"Jan 28, 14",0.003614
28135,2014-07-02,76.2804,76.9518,75.5997,76.9238,3693498,3693498,0.559517,0.733,76.5599,"Jul 2, 14",0.160881
298,2015-04-02,117.0658,117.5621,116.2793,117.3373,32220131,32220131,1.0018,0.861,116.9157,"Apr 2, 15",0.625704
36305,2016-12-19,83.7652,83.9119,82.6835,82.8852,9674437,9674437,-0.696666,-0.834,82.9784,"Dec 19, 16",0.049068
13333,2015-09-01,25.4881,25.6887,25.2237,25.3696,44143954,44143954,-0.656582,-2.523,25.4381,"Sep 1, 15",0.187654


##### Q3. Identify what information has been lost? Suggest solution in the cell below. 

## Step 4: Exploring the data

In [33]:
#Identifying missing values
data.isnull().sum()

date                0
open                0
high                0
low                 0
close               0
volume              0
unadjustedVolume    0
change              0
changePercent       0
vwap                0
label               0
changeOverTime      0
dtype: int64

##### Q4. You have already used some functions like
                pandas.DataFrame.head()
                pandas.DataFrame.sample()

##### Now explore outputs of following functions
                pandas.DataFrame.tail()
                pandas.DataFrame.info()
                pandas.DataFrame.describe()
            
            


You may also want to see values of a specific column

In [None]:
data['date'].head()

How many unique values in a certain columns?

In [None]:
len(data['date'].unique())

#### Q5. How many unique stocks are in the data set?