# Data from External Sources

![external](sampleImages/files.png)

We have already learned about containers and how to populate and manipulate data with them. 

The data for most of the examples that we have shown until now were created with in the program (either by populating the containers through loops or by hardcoding). The problem with such programatically generated data are that they are **temporarily stored in RAM (Random-Access-Memory) and are destroyed once the program completes**.

**For real-world applications, data is stored permanently in external storage systems and are hence not volatile. We use programming language to read this data from external sources to RAM for performing computation on them.**

## External Data Sources

![externaldatasources](sampleImages/externaldatasources.png)

1. Files stored in our computer (local hard drive).
2. Files stored in external hard drives.
3. Files accessible from remote systems via internet.
4. Local and remote database systems.
5. Web services (data available over the Web)

In this lesson we will look at **reading/writing files from local drives. We will also look at some code samples for retrieving data through web services**.

## Files

>Files are basically a **computing resource for recording data in storage device**. Files are classified based on the purpose of their usage. Typical types of computer files include text files, image files, video files, and audio files. 

![file-types](sampleImages/file-types.png)

For this session we will be looking at **three different file types**,

1. **Text file (.txt)**
2. **Comma Seperated Value (CSV) file (.csv)**
3. **JavaScript Object Notation (JSON) file (.json)**

### Text file

A text file generally contains a sequence of lines of electronic text. Open the file *sample_text_file.txt* in data folder and check its contents

Reading text file with Python

In [None]:
sentences = [] #lets initialize an empty list where we will load the sentences. Each line of text will be an element
with open('data/sample_text_file.txt','r') as file: #open() accepts file path as string, and file mode ('r' for read)
    for line in file: #here file is a file object and we are going to loop through it line by line using for loop
        sentences.append(line) #we are saving each line (which is a string) to the list sentences
print ('Total number of lines in the file:',len(sentences))
print (sentences[0])#print first line
print (sentences[-1])#print last line
print (sentences[:3])#print first three lines

Another way to **read a text file in a single shot (not line by line)**

In [None]:
sentences = [] #lets initialize an empty list where we will load the sentences. Each line of text will be an element
with open('data/sample_text_file.txt','r') as file: #open() accepts file path as string, and file mode ('r' for read)
    sentences =file.readlines() #this will read the file in a single shot
print ('Total number of lines in the file:',len(sentences))
print (sentences[0])#print first line
print (sentences[-1])#print last line
print (sentences[:3])#print first three lines

Writing to a text file with Python

You can only write strings to a file. If you need to write other data types you need to first convert it to string

**Let's write a simple message to a file**

In [None]:
message = 'Python is cool'
with open('outputdirectory/simple_message.txt','w'): #note unlike reading we have to use 'w' here
    outFile.write(message)

Open simple_message.txt in outputdirectory folder to verify whether the string was written to the file.

Now let's write a list of string data to file

In [None]:
sentences = ['I like to watch football.','I love driving.']
#we need to convert list of sentences to a single string seperated by '\n' (which is new line character)
sentenceString = '\n'.join(sentences) #Now we will have a string 'I like to watch football.\nI love driving.'
with open('outputdirectory/multiple_sentences.txt','w') as file: #note unlike reading we have to use 'w' here
    file.write(sentenceString)

Now that we have learned to read and write a text file, we will do a relatively complex task of **reading a file and counting the occurences of each word in the file**

In [None]:
wordCounts = {}
with open('data/sample_text_file.txt','r') as file: #open() accepts file path as string, and file mode ('r' for read)
    for line in file:
        words = line.split() #split a sentence by white space to get words. 'I am' will become ['I','am']
        for word in words: #now loop through words
            if word in wordCounts: # if word already existis in our dictionary update the count
                wordCounts[word] = wordCounts[word]+1
            else:
                wordCounts[word] = 1 #add new word and update its count as 1
print (wordCounts)

### Comma Seperated Value (CSV) File

CSV files are one of the most commonly used file format for data analysis. **CSV files have a tabular structure and can be opened in Microsoft Excel as a worksheet** (very convenient even for non-programmers). In reality they are **text files with comma used for delimitation**. We have already encountered comma delimited strings in last chapter and how they can be converted into a list using the string method split(','). Open the file us_major_cities.csv and check the content.

As a csv file is a text file with text seperated by commas, we can use the same strategy we have adopted for text file reading.

Reading a CSV file

In [None]:
data =[] #we will have a list of list
with open('data/us_major_cities.csv','r') as file:
    for line in file:
        words = line.split(',') #if the sentence is 'Cleveland,Ohio', split will convert to ['Cleveland','Ohio']
        data.append(words)
print (data[0]) #print first row of record. This will be a list of values
print (data[1]) #print second row of record.
print ((data[0][-1])) #print last column of first row
print ((data[1][-1])) #print last column of second row 

Now let's do something more complex. From the us_major_cities.csv file we will find out which city has the highest population. The population is the last column in the csv file. We have to also understand that text from files are read as string in Python. So you might want to convert population to int.

In [None]:
data = [] #will be a list of list containing rach rows
with open('data/us_major_cities.csv','r') as file:
    for line in file:
        words = line.split(',')
        data.append(words)
#Now we have every records in the data list. The first record is the header. So we might want to seperate that.
header = data[0] #the header will now hold the header row
records = data[1:] #the records will have all rows other than header
maxPop = int(records[0][-1]) #we are setting the maxVal as population (last column) of the city in the first row
maxIndex = 0 #we also want to know the position of the record so that we can retrieve it later. We set it to first row
counter = 0 #a counter to store the position of current row. Gets updated after each iteration
for record in records:
    population = int(record[-1])#we need to convert string to int
    if population>maxPop:
        maxPop=population #update maxPop to current population
        maxIndex = counter #update maxIndex to current counter value
    counter = counter+1
print ('Maximum population is',maxPop)
print (records[maxIndex]) #use the maxIndex to retrieve the particular record

### JavaScript Object Notation (JSON) File

JSON is a text-based format for representing structured data. It is commonly used for transmitting data in web application. Python has a module 'json' to parse json data as well as convert json data to string. Let's look at a sample file, sample.json in data folder. When you inspect sample.json you can see that it closely resembles the dictionary data structure in Python.

Reading a JSON file

For reading a JSON file we have to read the entire json file as a single string and convert it into a Python container using **loads() method in json module**.

In [None]:
import json #this is an in-built package for handling JSON data
#we have to read the entire JSON file as a single string (not line by line or not in a list)
dataString = '' #this variable will store the JSON data from the JSON file as a string
with open('data/sample.json','r') as file:
    dataString = file.read() #this will read the entire file to a single string
data = json.loads(dataString) #will convert the JSON string to a Python data type
print (type(data)) #check the type of data. In this case its a dictionary/dict
#Now we can use key based searching and retrieval as well as other dict methods
print (data['firstName'])  #retrieve the value for the key 'firstName'
print (data['lastName'])  #retrieve the value for the key 'lastName'
print (data['address']['city']) #data['address'] value is a dict and the value of 'city' in that dict is 'San Jone' 
print (data['phoneNumbers'][0]['number'])#data['phoneNumbers'] is a dict of dictionaries 

Writing to a JSON file

For writing to a JSON file we need to convert the container which has data to a JSON string using **dumps() method in JSON module.**

In [None]:
import json
#suppose we have a dictionary 
fruits = {'Apple':56,'Cherry':44,'Pappaya':33,'Grape':55}
fruitsJSONString = json.dumps(fruits) #convert the dict to a JSON string
with open('outputdirectory/fruits.json','w') as file:
    file.write(fruitsJSONString)

## Data from Web services

>**A web service can be defined as a set of protocols or standards which are used for exchanging data between applications or devices over the internet**. 

This enables softwares written in completely different languages to interact easily. The data exchange format could be in XML (Extensible Markup Language), or JSON. We will look at an example where we retrieve JSON data from a web service URL. 

Reading data from web service URL. We will be using an **built-in Python package called *requests*, which supports URL requests**. Our browsers perform such requests when we are moving from one page to another.

In [None]:
import requests #a package for performing http requests
url = 'https://randomuser.me/api/'  #this is a url just like https://www.google.com
response=requests.get(url) #get the response of calling this url. We can actually see the output through a browser
data = response.json() #this will convert the response text to a Python dictionary
print (type(data)) #check its data type to confirm
print (data['results'][0]['location']['country']) #data['results'] is a list of dictionaries

Now lets do a more complex example were we will have a set of 5000 users and we will check the distribution of countries (frequency of occurence)

In [39]:
import requests #a package for performing http requests
url = 'https://randomuser.me/api/?results=5000'  #this is a url just like https://www.google.com
response=requests.get(url) #get the response of calling this url. We can actually see the output through a browser
data = response.json() #this will convert the response text to a Python dictionary
countryCount = {} #for storing the number of occurences of different countries
#now we will loop through each of the data['results'] which is a list of dictionaries
for result in data['results']:
    country = result['location']['country'] #get the country from this result
    if country in countryCount: #if country exists then update its count
        countryCount[country] = countryCount[country]+1
    else:
        countryCount[country] = 1 #add new country and set its value to 1
print (countryCount)

{'Switzerland': 324, 'New Zealand': 269, 'Spain': 301, 'Norway': 298, 'United Kingdom': 327, 'Brazil': 310, 'France': 291, 'Finland': 294, 'Netherlands': 298, 'Denmark': 285, 'Germany': 288, 'United States': 278, 'Turkey': 314, 'Australia': 265, 'Iran': 292, 'Ireland': 276, 'Canada': 290}


**Reading text data from Web services**

In [None]:
import requests
data = requests.get('http://www.gutenberg.org/cache/epub/98/pg98.txt') #get the text data from web
textData = data.text #this will store the text data from response
print (textData)