----
**Author**: Gunnvant

**Description**: Working with flat files

**Audience**: Beginner

**Pre-requitisites**: Python 101

---





## TOC
- Using open() to manipulate text files is not very scalable
- csv reader: Read and easily manipulate flat files
- json reader: Read and manipulate json files


A couple of previous examples that we saw where the `open()` directive was used to handle text files. We saw that reading a row was usually not a very trivial excercise. Recall the class problem on twitter dataset, the following piece of code was written:

```python
day = '26'
cnt = 0
for tweet in tweets_list[1:]:
    date = tweet.split("|")[0] ###-> This split is not trivial, one has to know the delimiter, also here finding the date value in a row is again a little painful one has to know the position of columns in the table/dataset
    date = date.replace('"',"")
    if date.startswith(date_mapping[day]):
        cnt+=1
```

A lot of the unnecessary details are handled by a python module called [`csv`](https://docs.python.org/3/library/csv.html). We will use this module to see how it can make our life easy. Also if you do appear for data engineering interviews or even data science interviews you can expect first principle questions on your ability to work with simple flat files

In [1]:
import csv 
cnt = 0
with open("../../data/tweets_assignment.txt","r") as f:
    reader = csv.reader(f,delimiter="|")
    for row in reader: ## there was no need to do an explicit split based on |
        print(row) ## The quoted text seems to have been automatically handled
        cnt+=1
        if cnt==10:
            break

['Tweet_Date', 'Tweet_Text', 'Tweet_Handle']
['Thu Dec 26 123415 0000 2013', 'ashdubey ratigirl FirstpostBiz First economic move by AamAadmiParty  How come it has gone down in Guj', 'maverickenator']
['Thu Dec 26 123413 0000 2013', 'RT anandpassion Growing popularity can seen as AAP websites Rank 275 in India highest ever for a political party JoinAAP ankitlal Aam', 'aapkask']
['Thu Dec 26 123402 0000 2013', 'RT anandpassion Growing popularity can seen as AAP websites Rank 275 in India highest ever for a political party JoinAAP ankitlal Aam', 'arpshalder']
['Thu Dec 26 123354 0000 2013', 'ArvindKejriwal AapYogendra AamAadmiParty I hope your party realizes why the likes of alkalamba are trying to join AAP now Be careful', 'GargaC']
['Thu Dec 26 123351 0000 2013', 'RT ssttuuttii Press Conference AamAadmiParty Application form for candidature for LS2014 is being uploaded at website AAP4India http', 'kejriwalfanclub']
['Thu Dec 26 123346 0000 2013', 'anandpassion AamAadmiParty ankitlal Why

In [2]:
## Lets see if the csv reader also handles the quoted characters
cnt = 0
with open("../../data/tweets_assignment.txt","r") as f:
    reader = csv.reader(f,delimiter = "|")
    for row in reader:
        print(row[0].startswith("Thu Dec 26")) ## This confirms that the csv module will automatically take care of many details for us
        cnt+=1
        if cnt==12:
            break
    

False
True
True
True
True
True
True
True
True
True
True
True


In [3]:
## There is another reader called dictreader
cnt = 0
with open("../../data/tweets_assignment.txt","r") as f:
    reader = csv.DictReader(f,delimiter="|")
    for row in reader:
        print(row['Tweet_Date']) ## this is now even simpler as column values can be accessed as dictionay key-value pairs
        cnt+=1
        if cnt==10:
            break

Thu Dec 26 123415 0000 2013
Thu Dec 26 123413 0000 2013
Thu Dec 26 123402 0000 2013
Thu Dec 26 123354 0000 2013
Thu Dec 26 123351 0000 2013
Thu Dec 26 123346 0000 2013
Thu Dec 26 123343 0000 2013
Thu Dec 26 123341 0000 2013
Thu Dec 26 123333 0000 2013
Thu Dec 26 123328 0000 2013


In [4]:
cnt = 0
with open("../../data/tweets_assignment.txt","r") as f:
    reader = csv.DictReader(f,delimiter="|")
    for row in reader:
        print(row['Tweet_Date'].startswith("Thu Dec 26"))
        cnt+=1
        if cnt==10:
            break

True
True
True
True
True
True
True
True
True
True


### Class Excercise (writer and DictWriter)

Use the tweets_assignment.txt to find out how many tweets were posted on 23, 24, 25, 26 December. Write the results in a csv file using the writer or DictWriter class. The file should follow the following format:

```csv
count_23,count_24,count_25,count_26
val1,val2,val3,val4
```

## JSON Data

Another very popular format for data is the json data. This is nowadays the prefered way of transfering data over web applications. As a data analyst, data scientist or a data engineer you will encounter this form of data more often than not. We will use the dataset named `sample_json.json` to see what is the general structure of json data. Use this [link](http://jsonviewer.stack.hu/) to understand the `sample_json.json` better

In [5]:
import json
with open("../../data/sample_json.json","r",encoding="utf-8") as f:
    data = json.loads(f.read())

In [7]:
data.keys()

dict_keys(['type', 'metadata', 'features', 'bbox'])

In [8]:
data['features'][0]

{'type': 'Feature',
 'properties': {'mag': 2.1500001,
  'place': '3 km ENE of Pāhala, Hawaii',
  'time': 1626859250480,
  'updated': 1626859447300,
  'tz': None,
  'url': 'https://earthquake.usgs.gov/earthquakes/eventpage/hv72595567',
  'detail': 'https://earthquake.usgs.gov/earthquakes/feed/v1.0/detail/hv72595567.geojson',
  'felt': None,
  'cdi': None,
  'mmi': None,
  'alert': None,
  'status': 'automatic',
  'tsunami': 0,
  'sig': 71,
  'net': 'hv',
  'code': '72595567',
  'ids': ',hv72595567,',
  'sources': ',hv,',
  'types': ',origin,phase-data,',
  'nst': 44,
  'dmin': None,
  'rms': 0.100000001,
  'gap': 113,
  'magType': 'md',
  'type': 'earthquake',
  'title': 'M 2.2 - 3 km ENE of Pāhala, Hawaii'},
 'geometry': {'type': 'Point',
  'coordinates': [-155.451171875, 19.2175006866455, 35.3600006103516]},
 'id': 'hv72595567'}

## Class Excercise:

Extract the information about each earthquake event's:

1. Magnitude 
2. Location
3. Url

Create three lists with information on the three parameters listed above. Then write this information in a csv file.

In [7]:
magnitude = [f['properties']['mag'] for f in data['features']]
location = [f['properties']['place'] for f in data['features']]
url = [f['properties']['url'] for f in data['features']]

In [8]:
with open("../../data/json_data.csv","w",encoding='utf-8') as f:
    fieldnames = ['magnitude','location','url']
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    for mag,loc,u in zip(magnitude,location,url):
        writer.writerow({'magnitude': mag, 'location': loc,'url':u})

## Class Assignment

Use the same datset as above and now filter out the places where the magnitude was more than 1.8. You can either use the json version or the csv you have created in the above excercise.