# <font color = 'dodgerblue'>**Handling Text Data**

 Objective: Learn to import text data into Pandas data frame. A lot of time we get text data from online resources. The data is available in compressed format (like zip, tar). We need to understand how to get the data from online resources and convert the data into a  dataframe so that we can start pre-processing the data.

**Typical Steps**
- Step1: use !wget to download data files from URl
- Step2: check content of folder where data was downloaded
- Step3: Check content of zipped/tar folder
- Step 4: unzip/untar files
- Step 4a: Understand the structure of unzipped folder/files (only required if data is stored in multiple folders and files).
- Step 5: Check content of the main file using !head command.
- Step 6a: combine data from multiple files ( if data is stored in multiple files).
- Step6: Create a dataframe



We will use the pandas library to read files from sources and filepaths. Let us import the pandas library

# <font color = 'dodgerblue'>**Import libraries**

In [None]:
# import libraries
import pandas as pd
from pathlib import Path
import zipfile
import tarfile

# <font color = 'dodgerblue'>**Mount Google drive and Specify folder paths**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Make sure you change the Path to where you want to save data
# In the code below - datasets is the folder name in my google drive
# you can change this to appropriate folder for your drive
# for example you may want to save data to BUAN6341/HW1/Data
# in this case the below code should be modified to : '/content/drive/MyDrive/BUAN6341/HW2/Data'

base_path = '/content/drive/MyDrive/data'

In [None]:
# create a POSIX path for data folder
# we can use this to navigate file system
base_folder = Path(base_path)

In [None]:
# I usually keep teh compressed files in archive folder and unzip these files in data folder
# You can skip this step if you do not want to follow this folder structure

# The / can join several paths or a mix of paths and strings given, atleast one of those
# paths should be an instance of class `Path` from `pathlib` library (as shown below).

archive_folder = base_folder/'archive'
data_folder = base_folder/'datasets'

In [None]:
# check current working directory
Path.cwd()

PosixPath('/content')

# <font color = 'dodgerblue'>**Task 1: Get Data from a CSV file**

Task : Read data from a csv file. <br>
We can download the data from following source: Sentiment 140 dataset:
http://help.sentiment140.com/for-students

Url for data: http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip

Data Columns:
- 0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
- 1 - the id of the tweet (2087)
- 2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
- 3 - the query (lyx). If there is no query, then this value is NO_QUERY.
- 4 - the user that tweeted (robotickilldozr)
- 5 - the text of the tweet (Lyx is cool)

## <font color = 'dodgerblue'>**Step1: use wget to download data files from URl**

Download a  file to the filesystem from a url using the wget commmand <br>
URL = http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip <br>

**Syntax** <br> !wget {url} -P {path_to_save_file} -O filename <br>
- To use variables in bash commands , we have to use {} brackets 
- if we do not specify -P , files will be saved in current direcory 
- Use O if you want to overwrite existing file
<br>

Alternatively we can also use <br>
!wget url -P 'path_to_save_file'

In [None]:
# use wget to download the data
file = archive_folder/'trainingandtestdata.zip'
URL = 'http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip'
!wget {URL} -P {archive_folder} -O {file}

--2022-08-29 03:21:00--  http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Resolving cs.stanford.edu (cs.stanford.edu)... 171.64.64.64
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip [following]
--2022-08-29 03:21:00--  https://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip
Connecting to cs.stanford.edu (cs.stanford.edu)|171.64.64.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81363704 (78M) [application/zip]
Saving to: ‘/content/drive/MyDrive/data/archive/trainingandtestdata.zip’


2022-08-29 03:21:06 (15.0 MB/s) - ‘/content/drive/MyDrive/data/archive/trainingandtestdata.zip’ saved [81363704/81363704]



## <font color = 'dodgerblue'>**Step2: check content of folder where data was downloaded**

In [None]:
# step 2: check teh content of the folder where file was downloaded
# List the content of current directory
# folder is a Pathlib Path. We can iterate over this folder using .iterdir()
for entries in archive_folder.iterdir():
  if 'zip' in entries.name:
    print(entries.name)

ml-latest-small.zip
COVID-19.zip
tweeteval-main.zip
Reviews.csv.zip
Malware_df.csv.zip
News_Category_Dataset_v2.json.zip
mashqa_data.zip
bike-sharing-demand.zip
OpinRank-master.zip
trainingandtestdata.zip
master.zip


## <font color = 'dodgerblue'>**Step3: Check content of zipped/tar folder**

We can construct a path to the file by joining the parts using the special operator /. The / can join several paths or a mix of paths and strings given, atleast one of those paths should be an instance of class `Path` from `pathlib` library (as shown below).


In [None]:
# path for zipfile
file = archive_folder / 'trainingandtestdata.zip'

We will open and read the zip file using 
```python
with zipfile.ZipFile(file, mode)
``` 
here `file` is the file to open and we can specify the mode (read, write etc,)
In below command we have used `'r'` to specify that we can open the file in reading mode.
Finally we use namelist() method to list the content of zipped folder.

In [None]:
# list the content of the zipped folder
with zipfile.ZipFile(file, 'r') as f:
  print(f.namelist())

['testdata.manual.2009.06.14.csv', 'training.1600000.processed.noemoticon.csv']


## <font color = 'dodgerblue'>**Step 4: unzip/untar files**

We will open the file using 
```python
with zipfile.ZipFile(file, mode)
``` 
Finally  we will use file.extract to extarct a particular file from zipped folder
```python
file.extract(file_to_extract, path)
```
In the above command file_to_extract is the file from zipped folder that we want to extract, and path is where we want to save the extracted file.

In [None]:
# %%timeit will tell us the time it takes to execute this cell
%%timeit 
# unzip the file 

file = archive_folder / 'trainingandtestdata.zip' 
with zipfile.ZipFile(file, 'r') as f:
  f.extract('training.1600000.processed.noemoticon.csv', path = data_folder)

3.85 s ± 347 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
# we can also use bash !unzip command
# Syntax !unzip {file} -d {location}
# if we do not specify -d option, files will be saved in current directory
# here file is the file to unzip and location is where we ant to extract files
# you can uncomment the lines below if you want to compare  zipfile and !unzip commands

# %%timeit
# file = archive_folder / 'trainingandtestdata.zip'
# !unzip -o {str(file)} -d {str(data_folder)} 


## <font color = 'dodgerblue'>**Step 5: Check content of the main file using head command**

In [None]:
# before importing the file let us see how it looks like
file_csv = data_folder / 'training.1600000.processed.noemoticon.csv'
# the bash command head helps us to see the content of the file
# We need to pass the path to the file as a string. We use Python's str function to convert the Path to string
# We can specify no.of lines to be printed using the option -n.
!head -n 5 {str(file_csv)}

"0","1467810369","Mon Apr 06 22:19:45 PDT 2009","NO_QUERY","_TheSpecialOne_","@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D"
"0","1467810672","Mon Apr 06 22:19:49 PDT 2009","NO_QUERY","scotthamilton","is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!"
"0","1467810917","Mon Apr 06 22:19:53 PDT 2009","NO_QUERY","mattycus","@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds"
"0","1467811184","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","ElleCTF","my whole body feels itchy and like its on fire "
"0","1467811193","Mon Apr 06 22:19:57 PDT 2009","NO_QUERY","Karoli","@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. "


## <font color = 'dodgerblue'>**Step6: Create pandas dataframe**

In [None]:
# Create a Pandas Dataframe object by specifying the file name with path and other arguments
# pd.read_csv('<filename_with_path>', header=<names of columns>,delimiter=','(for csv files))
# The encoding argument can contain UTF-8, ISO-8859-1 or unicode-escape. 
# It is difficult to detect the type of encoding to use for a file from an external source. 
# Usually, this argument is not invoked, on the off-case that the default encoding does not work, 
# use either of the options mentioned.
# We can add column names if it doesn't exist using the 'names=' argument.

df = pd.read_csv(file_csv, names = ['polarity', 'id', 'date', 'query', 'user', 'text'],
                 encoding = 'ISO-8859-1', header = None)
df

Unnamed: 0,polarity,id,date,query,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."
...,...,...,...,...,...,...
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...


# <font color = 'dodgerblue'>**Task 2: Get Data from a text file** 
Task : Read data from a txt file. <br>
We can download the data from following source: https://norvig.com/big.txt

## <font color = 'dodgerblue'>**Step1: use wget to download data files from URL**

In [None]:
# Download a  file to the filesystem from a url using the wget commmand
# syntax  !wget {url} -P {path_to_save_file}
# if we do not specify -P , files will be saved in current direcory
file = data_folder/'big.txt'
url = "https://norvig.com/big.txt"
!wget {url} -P {data_folder} -O {file}

--2022-08-29 03:22:27--  https://norvig.com/big.txt
Resolving norvig.com (norvig.com)... 158.106.138.13
Connecting to norvig.com (norvig.com)|158.106.138.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6488666 (6.2M) [text/plain]
Saving to: ‘/content/drive/MyDrive/data/datasets/big.txt’


2022-08-29 03:22:31 (1.98 MB/s) - ‘/content/drive/MyDrive/data/datasets/big.txt’ saved [6488666/6488666]



## <font color = 'dodgerblue'>**Step2: Check content of folder where data was downloaded**

In [None]:
# List the content of  directory
for entries in data_folder.iterdir():
  if 'txt' in entries.name: 
    print(entries.name)

labels.txt
reviews.txt
wiki.10K.txt
big.txt
txt_sentoken


## <font color = 'dodgerblue'>**Step 5: Check content of the main file using head command**

In [None]:
# Look at the first kilobyte of the .txt file 
file_text = data_folder / 'big.txt'
!head {str(file_text)}

The Project Gutenberg EBook of The Adventures of Sherlock Holmes
by Sir Arthur Conan Doyle
(#15 in our series by Sir Arthur Conan Doyle)

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.

This header should be the first thing seen when viewing this Project
Gutenberg file.  Please do not remove it.  Do not change or edit the


## <font color = 'dodgerblue'>**Use open to retrieve data**

In [None]:
# Open and read the file using open(filename, 'r') function
# We will use with open as it closes the file as well
# This is a recommended way to open a text file 

with open(file_text, 'r') as file:
  text = file.read()

# The text variable now contains the contents of the text file.
text[0:100]

'The Project Gutenberg EBook of The Adventures of Sherlock Holmes\nby Sir Arthur Conan Doyle\n(#15 in o'

# <font color = 'dodgerblue'>**Task 3: Get Data from a JSON File**

## <font color = 'dodgerblue'>**Example 1 - small file**
Task : Read data from a JSON file. <br>
We can download the data from following source: https://filesamples.com/samples/code/json/sample4.json

### <font color = 'dodgerblue'>**Step1: use wget to download data files from URl**

In [None]:
# Download a  file to the filesystem from a url using the wget commmand
# syntax  !wget {url} -P {path_to_save_file}
# if we do not specify -P , files will be saved in current direcory

url = 'https://filesamples.com/samples/code/json/sample4.json'
file = data_folder/'sample4.json'
# we can specify variables using {}
!wget {url} -P {data_folder} -O {file}

--2022-08-29 03:22:52--  https://filesamples.com/samples/code/json/sample4.json
Resolving filesamples.com (filesamples.com)... 104.21.17.252, 172.67.178.244, 2606:4700:3035::ac43:b2f4, ...
Connecting to filesamples.com (filesamples.com)|104.21.17.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 451 [application/json]
Saving to: ‘/content/drive/MyDrive/data/datasets/sample4.json’


2022-08-29 03:22:52 (2.09 MB/s) - ‘/content/drive/MyDrive/data/datasets/sample4.json’ saved [451/451]



### <font color = 'dodgerblue'>**Step2: Check content of folder where data was downloaded**

In [None]:
# check content of basepath
for entries in data_folder.iterdir():
  if 'json' in entries.name:
    print(entries.name)

sample4.json
News_Category_Dataset_v2.json


### <font color = 'dodgerblue'>**Step 5: Check content of main file using head command**

In [None]:
# let us see how the file looks like
file_json1 = data_folder / 'sample4.json'
!head {file_json1}

{
  "people" : [
    {
       "firstName": "Joe",
       "lastName": "Jackson",
       "gender": "male",
       "age": 28,
       "number": "7349282382"
    },
    {


### <font color = 'dodgerblue'>**Step6: Create pandas dataframe**

In [None]:
# read json file using pd.read_json which returns pandas DataFrame object
json = pd.read_json(file_json1)
json

Unnamed: 0,people
0,"{'firstName': 'Joe', 'lastName': 'Jackson', 'g..."
1,"{'firstName': 'James', 'lastName': 'Smith', 'g..."
2,"{'firstName': 'Emily', 'lastName': 'Jones', 'g..."


In [None]:
# check the column 
json.people

0    {'firstName': 'Joe', 'lastName': 'Jackson', 'g...
1    {'firstName': 'James', 'lastName': 'Smith', 'g...
2    {'firstName': 'Emily', 'lastName': 'Jones', 'g...
Name: people, dtype: object

In [None]:
# iterate over rows
for row in json.people:
  print(row)

{'firstName': 'Joe', 'lastName': 'Jackson', 'gender': 'male', 'age': 28, 'number': '7349282382'}
{'firstName': 'James', 'lastName': 'Smith', 'gender': 'male', 'age': 32, 'number': '5678568567'}
{'firstName': 'Emily', 'lastName': 'Jones', 'gender': 'female', 'age': 24, 'number': '456754675'}


In [None]:
# we can easily create a dataframe from list of dictionaries
# first create list of dictionaries using list comprehension
list_of_dict = [row for row in json.people]
list_of_dict

[{'firstName': 'Joe',
  'lastName': 'Jackson',
  'gender': 'male',
  'age': 28,
  'number': '7349282382'},
 {'firstName': 'James',
  'lastName': 'Smith',
  'gender': 'male',
  'age': 32,
  'number': '5678568567'},
 {'firstName': 'Emily',
  'lastName': 'Jones',
  'gender': 'female',
  'age': 24,
  'number': '456754675'}]

In [None]:
# now we can create datafreme from list of dictionaries
df_json = pd.DataFrame(list_of_dict)
df_json

Unnamed: 0,firstName,lastName,gender,age,number
0,Joe,Jackson,male,28,7349282382
1,James,Smith,male,32,5678568567
2,Emily,Jones,female,24,456754675


In [None]:
# we can do this in one step as follows
df_json = pd.DataFrame([row for row in json.people])

In [None]:
df_json

Unnamed: 0,firstName,lastName,gender,age,number
0,Joe,Jackson,male,28,7349282382
1,James,Smith,male,32,5678568567
2,Emily,Jones,female,24,456754675


## <font color = 'dodgerblue'>**Example 2 - large file**
Download the dataset from Kaggle using this link below  
https://www.kaggle.com/rmisra/news-category-dataset
Kaggle is an open-source dataset website, that requires  account sign-in.  

Upload the dataset to your Google Drive from your local machine.
We need to mount Google Drive to access files in Google drive in Colab.

See  the instructions here if you want to download data from kaggle from Colab: https://www.kaggle.com/general/74235




### <font color = 'dodgerblue'>**Save file to google drive and check content**

In [None]:
# Look in the contents of the folder where we saved the file
for entries in archive_folder.iterdir():
  if 'json' in entries.name:
    print(entries.name)

News_Category_Dataset_v2.json.zip


### <font color = 'dodgerblue'>**Step3: Check content of zipped/tar folder**


In [None]:
# path for zipfile
file = archive_folder / 'News_Category_Dataset_v2.json.zip'

# use .namelist() to see the content of zipped folder
with zipfile.ZipFile(file, 'r') as f:
  print(f.namelist())

['News_Category_Dataset_v2.json']


### <font color = 'dodgerblue'>**Step 4: unzip/untar files**

In [None]:
# extract the file
%%timeit
file = archive_folder / 'News_Category_Dataset_v2.json.zip'
with zipfile.ZipFile(file, 'r') as f:
  f.extract('News_Category_Dataset_v2.json', path = data_folder)


1.39 s ± 332 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
# we can also use !unzip command
# Syntax !unzip {file} -d {location}
# if we do not specify -d option files will be saved in current directory
# you can uncomment the lines below if you want to compare  zipfile and !unzip commands
'''
%%timeit
file = folder / 'News_Category_Dataset_v2.json'
!unzip -o {str(file)} -d {str(folder)}
'''

### <font color = 'dodgerblue'>**Step 5: Check content of the main file using head command**

In [None]:
# check content of fle using head
file = data_folder / 'News_Category_Dataset_v2.json'
!head {file}

{"category": "CRIME", "headline": "There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV", "authors": "Melissa Jeltsen", "link": "https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89", "short_description": "She left her husband. He killed their children. Just another day in America.", "date": "2018-05-26"}
{"category": "ENTERTAINMENT", "headline": "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song", "authors": "Andy McDonald", "link": "https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201", "short_description": "Of course it has a song.", "date": "2018-05-26"}
{"category": "ENTERTAINMENT", "headline": "Hugh Grant Marries For The First Time At Age 57", "authors": "Ron Dicker", "link": "https://www.huffingtonpost.com/entry/hugh-grant-marries_us_5b09212ce4b0568a880b9a8c", "short_description": "The actor and his longtime gi

### <font color = 'dodgerblue'>**Step6: Create pandas dataframe**

In [None]:
# specify file
file  = data_folder / 'News_Category_Dataset_v2.json'
# lines= argument is used for JSON rows that have more than one line (long lines of data in JSON Rows)
large_json = pd.read_json(file, lines=True)
large_json.head()

Unnamed: 0,category,headline,authors,link,short_description,date
0,CRIME,There Were 2 Mass Shootings In Texas Last Week...,Melissa Jeltsen,https://www.huffingtonpost.com/entry/texas-ama...,She left her husband. He killed their children...,2018-05-26
1,ENTERTAINMENT,Will Smith Joins Diplo And Nicky Jam For The 2...,Andy McDonald,https://www.huffingtonpost.com/entry/will-smit...,Of course it has a song.,2018-05-26
2,ENTERTAINMENT,Hugh Grant Marries For The First Time At Age 57,Ron Dicker,https://www.huffingtonpost.com/entry/hugh-gran...,The actor and his longtime girlfriend Anna Ebe...,2018-05-26
3,ENTERTAINMENT,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,Ron Dicker,https://www.huffingtonpost.com/entry/jim-carre...,The actor gives Dems an ass-kicking for not fi...,2018-05-26
4,ENTERTAINMENT,Julianna Margulies Uses Donald Trump Poop Bags...,Ron Dicker,https://www.huffingtonpost.com/entry/julianna-...,"The ""Dietland"" actress said using the bags is ...",2018-05-26


# <font color = 'dodgerblue'>**Task 4: Get Data from multiple text files**
Many time we will have dataset in a format where the data is stored in multiple files. For example - each review is a separate file.
<br> We can download the data from here : http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

Website :http://www.cs.cornell.edu/people/pabo/movie-review-data/ <br>
Also availaible on : https://github.com/jbrownlee/Datasets

Write a function to process this data <br>
Output : Pandas Data Frame


## <font color = 'dodgerblue'>**Step1: use wget to download data files from URl**

In [None]:
# Download a  file to the filesystem from a url using the wget commmand
# syntax  !wget {url} -P {path_to_save_file}
# if we do not specify -P , files will be saved in current direcory
file = archive_folder/'review_polarity.tar.gz'
url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz'
!wget {url} -P {archive_folder} -O {file}

--2022-08-28 11:36:03--  http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.36
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.36|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3127238 (3.0M) [application/x-gzip]
Saving to: ‘/content/drive/MyDrive/datasets/archive/review_polarity.tar.gz’


2022-08-28 11:36:04 (20.1 MB/s) - ‘/content/drive/MyDrive/datasets/archive/review_polarity.tar.gz’ saved [3127238/3127238]



## <font color = 'dodgerblue'>**Step2: check content of folder where data was downloaded**

In [None]:
# list files of google drive where data was downloaded
for entries in archive_folder.iterdir():
  if 'tar' in entries.name:
    print(entries.name)

review_polarity.tar.gz
scale_whole_review.tar.gz
20news-bydate.tar.gz


Since this is a tar file, we will use tarfile module to read the file
```python
with  tarfile.open(tar_file, 'r') as tar:
  print(tar.getnames())
```
Here tar_fileis is the compressed file. We first open this file in read mode and then use file.getnames to list the content of the file

## <font color = 'dodgerblue'>**Step3: Check content of zipped/tar folder**

In [None]:
## <font color = 'dodgerblue'>**check the content of tar file 
file = archive_folder / 'review_polarity.tar.gz'
with  tarfile.open(file, 'r') as tar:
  tar_names = tar.getnames()

NameError: name 'archive_folder' is not defined

In [None]:
tar_names[0:10]

## <font color = 'dodgerblue'>**Step 4: unzip/untar files**

We will now open the tar file in read mode and then use file.extractall to extract all the files
```python
with  tarfile.open(file, 'r') as tar:
  tar.extractall(path = folder)
```
path is the location where we want to save the extracted files.

In [None]:
# Let us extract the files from the tar file using the command below
# the command below may take upto 10-15 minutes

file = archive_folder / 'review_polarity.tar.gz'
with  tarfile.open(file, 'r') as tar:
  tar.extractall(path = data_folder)

In [None]:
# we can also use !tar command as below
# We use tar xzf to to untar the compressed tar file to extract contents of the file
#-x = extract
#-z = gzipped archive
#-f = get from a file
# syntax tar xzf {file} -C {location_to_untar_files}
# you can uncomment the lines below if you want to compare  tarfile and !tar commands
'''
%%timeit
file = folder / 'review_polarity.tar.gz'
!tar xzf {str(file)} -C {str(folder)}
'''

## <font color = 'dodgerblue'>**Step 4a: Understand the structure of unzipped folder/files**

In [None]:
# check the content of the folder where files were extracted
for entries in data_folder.iterdir():
  if 'txt' in entries.name:
    print(entries.name)

labels.txt
reviews.txt
wiki.10K.txt
big.txt
txt_sentoken


In [None]:
# Look in the new directory named 'txt_sentoken'
for entries in (data_folder/'txt_sentoken').iterdir():
  print(entries.name)

neg
pos


In [None]:
# Look in the subdirectory named 'pos'
i = 0
for entries in (data_folder/'txt_sentoken'/'pos').iterdir():
  print(entries.name)
  i+= 1
  if i >10:
    break

cv000_29590.txt
cv001_18431.txt
cv002_15918.txt
cv003_11664.txt
cv004_11636.txt
cv005_29443.txt
cv006_15448.txt
cv007_4968.txt
cv008_29435.txt
cv009_29592.txt
cv010_29198.txt


In [None]:
# let us use the head command to get first five lines if a file
!head {data_folder/'txt_sentoken'/'pos'/'cv000_29590.txt'} -n 5

films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 


## <font color = 'dodgerblue'>**Step 6a: combine all text files**

Let us extract all the text files into a list of strings. We can do this using a for loop to iterate through all the files in a particular directory. There are two sub directories in the main directory, pos and neg. Each sub directory has multiple text files with extension .txt. Each .txt file has a single review.

We will add these texts into different Python list variables as `positive_reviews` and `negative_reviews`.

In [None]:
def combine_reviews(path):
  reviews = []

  # We use the below command to read all the text files in the  directory and store in our positive_reviews list.
  for file in path.iterdir():
    # check if the file is a text file
    if file.suffix == '.txt':
      # We can open files and read or write their contents using open() function
      # The files are opened in read-only mode for reading content
      with open(path/file,'r') as f:
        # We store our text from the files into the positive_reviews list as an element in our list
        text = f.read()
        # append the review to the list
        reviews.append(text)
  return reviews

In [None]:
# Load positive reviews data
path_pos = data_folder/'txt_sentoken'/'pos'
path_neg = data_folder/'txt_sentoken'/'neg'
positive_reviews = combine_reviews(path_pos)
negative_reviews = combine_reviews(path_neg)

In [None]:
'1'*10

'1111111111'

## <font color = 'dodgerblue'>**Step6: Create pandas dataframe**

In [None]:
# We will now create a Pandas DataFrame 
# We combine both the lists in order and store these in column 'Reviews'
# the next column ('Labels') will contain the labels matching each text content
# In the Review column, we have positive reviews followed by negative reviews
# So when we create Labels column - we first generate list which contains string 1 
# we repeat 1 as amany time as the length of positive_reviews 
# similarly we generate a second string which contains 0's.
# We finally concatenate these two strings and use list function to convert it to a list

review_polarity = pd.DataFrame({'Reviews':positive_reviews + negative_reviews,
                                'Labels':list('1' * len(positive_reviews) + '0' * len(negative_reviews))})

In [None]:
# check the infor of the dataset
review_polarity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  2000 non-null   object
 1   Labels   2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [None]:
# We want our labels in the review_polarity to be int32 type, we will change that here.
review_polarity.astype({'Labels':'int32'}).dtypes

Reviews    object
Labels      int32
dtype: object

In [None]:
# Let us look at our DataFrame
review_polarity

Unnamed: 0,Reviews,Labels
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,""" jaws "" is a rare film that grabs your atten...",1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"john boorman's "" zardoz "" is a goofy cinematic...",0
1996,the kids in the hall are an acquired taste . \...,0
1997,there was a time when john carpenter was a gre...,0
1998,two party guys bob their heads to haddaway's d...,0


# <font color = 'dodgerblue'>**Task 5: Get Data from multiple CSV files**
In this task, we will understand how to extract multiple csv files at once and organize them in a single Dataset.  

Let us download our dataset from this source  
http://kavita-ganesan.com/entity-ranking-data/#.YNJs5nVKj_f

The dataset contains a zip file with two folders that each contain multiple csv files. 

The exact URL for the dataset: https://github.com/kavgan/OpinRank/archive/refs/heads/master.zip

## <font color = 'dodgerblue'>**Step1: use wget to download data files from URl**

In [None]:
# Download a .csv file to our filesystem from a url using the wget commmand
# syntax
# !wget {url} -P {path_to_save_file}
# if we do not specify -P , files will be saved in current direcory
file = archive_folder/'master.zip'
url = 'https://github.com/kavgan/OpinRank/archive/refs/heads/master.zip'
!wget {url}  -P {archive_folder} -O {file}

--2022-08-28 11:48:54--  https://github.com/kavgan/OpinRank/archive/refs/heads/master.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/kavgan/OpinRank/zip/refs/heads/master [following]
--2022-08-28 11:48:54--  https://codeload.github.com/kavgan/OpinRank/zip/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.114.9
Connecting to codeload.github.com (codeload.github.com)|140.82.114.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘/content/drive/MyDrive/datasets/archive/master.zip’

master.zip              [         <=>        ]  98.78M  59.5MB/s    in 1.7s    

2022-08-28 11:49:00 (59.5 MB/s) - ‘/content/drive/MyDrive/datasets/archive/master.zip’ saved [103574688]



## <font color = 'dodgerblue'>**Step2: check content of folder where data was downloaded**

In [None]:
# list files of directory
for entries in archive_folder.iterdir():
  if 'zip' in entries.name:
    print(entries.name)

ml-latest-small.zip
COVID-19.zip
tweeteval-main.zip
Reviews.csv.zip
Malware_df.csv.zip
News_Category_Dataset_v2.json.zip
mashqa_data.zip
bike-sharing-demand.zip
OpinRank-master.zip
trainingandtestdata.zip
master.zip


## <font color = 'dodgerblue'>**Step3: Check content of zipped/tar folder**

In [None]:
# check content of zip file 
file = archive_folder / 'master.zip'
with zipfile.ZipFile(file, 'r') as f:
  print(f.namelist())

['OpinRank-master/', 'OpinRank-master/OpinRankDatasetWithJudgments.zip', 'OpinRank-master/README.md']


## <font color = 'dodgerblue'>**Step 4: unzip/untar files**

In [None]:
# unzip file master.zip
file = archive_folder / 'master.zip'
with zipfile.ZipFile(file, 'r') as f:
  f.extractall(path = data_folder)

In [None]:
for entries in data_folder.iterdir():
  if 'master' in entries.name:
    print(entries.name)

OpinRank-master


In [None]:
for entries in (data_folder/'OpinRank-master').iterdir():
  print(entries.name)

OpinRankDatasetWithJudgments.zip
README.md


In [None]:
# check content of zip file 
file = archive_folder / 'OpinRank-master' / 'OpinRankDatasetWithJudgments.zip'
with zipfile.ZipFile(file, 'r') as f:
  print(f.namelist())


['OpinRankDatasetWithJudgments.pdf', 'cars/data/', 'cars/data/2007.csv', 'cars/data/2007/', 'cars/data/2007/2007_acura_mdx', 'cars/data/2007/2007_acura_rdx', 'cars/data/2007/2007_acura_rl', 'cars/data/2007/2007_acura_tl', 'cars/data/2007/2007_acura_tsx', 'cars/data/2007/2007_audi_a3', 'cars/data/2007/2007_audi_a4', 'cars/data/2007/2007_audi_a6', 'cars/data/2007/2007_audi_a8', 'cars/data/2007/2007_audi_q7', 'cars/data/2007/2007_audi_rs4', 'cars/data/2007/2007_audi_s8', 'cars/data/2007/2007_bmw_3_series', 'cars/data/2007/2007_bmw_5_series', 'cars/data/2007/2007_bmw_6_series', 'cars/data/2007/2007_bmw_x3', 'cars/data/2007/2007_bmw_x5', 'cars/data/2007/2007_bmw_z4', 'cars/data/2007/2007_buick_lacrosse', 'cars/data/2007/2007_buick_lucerne', 'cars/data/2007/2007_buick_rainier', 'cars/data/2007/2007_buick_rendezvous', 'cars/data/2007/2007_cadillac_cts', 'cars/data/2007/2007_cadillac_dts', 'cars/data/2007/2007_cadillac_escalade', 'cars/data/2007/2007_cadillac_escalade_esv', 'cars/data/2007/200

In [None]:
# Since the zipped file has multiple files and subfolders, we will extract everything in a sub-folder
# create a new sub-directory
(data_folder/'opinrank').mkdir(exist_ok=True)

In [None]:
# unzip file OpinRank-master/OpinRankDatasetWithJudgments.zip
# this cell can take up to 35 mins to execute

file = archive_folder / 'OpinRank-master' / 'OpinRankDatasetWithJudgments.zip'
with zipfile.ZipFile(file, 'r') as f:
  f.extractall(path = data_folder/'opinrank')

## <font color = 'dodgerblue'>**Step 4a: Understand the structure of unzipped folder/files**

In [None]:
for i in (data_folder/'opinrank').glob('*'):
     print(i.name)

OpinRankDatasetWithJudgments.pdf
cars
OpinRankDatasetWithJudgments.doc
hotels


In [None]:
import os
# list files of directory
for dir, subdirs, files in os.walk(data_folder/'opinrank'):
      for subdir in subdirs:
        print(os.path.join(dir, subdir))

/content/drive/MyDrive/datasets/data/opinrank/cars
/content/drive/MyDrive/datasets/data/opinrank/hotels
/content/drive/MyDrive/datasets/data/opinrank/cars/data
/content/drive/MyDrive/datasets/data/opinrank/cars/judgments
/content/drive/MyDrive/datasets/data/opinrank/cars/data/2007
/content/drive/MyDrive/datasets/data/opinrank/cars/data/2008
/content/drive/MyDrive/datasets/data/opinrank/cars/data/2009
/content/drive/MyDrive/datasets/data/opinrank/cars/judgments/2007
/content/drive/MyDrive/datasets/data/opinrank/cars/judgments/2008
/content/drive/MyDrive/datasets/data/opinrank/cars/judgments/2009
/content/drive/MyDrive/datasets/data/opinrank/hotels/data
/content/drive/MyDrive/datasets/data/opinrank/hotels/judgments
/content/drive/MyDrive/datasets/data/opinrank/hotels/data/beijing
/content/drive/MyDrive/datasets/data/opinrank/hotels/data/chicago
/content/drive/MyDrive/datasets/data/opinrank/hotels/data/dubai
/content/drive/MyDrive/datasets/data/opinrank/hotels/data/las-vegas
/content/driv

In [None]:
# let us see the fles  in data folder
for entries in (data_folder/'opinrank'/'cars'/'data').iterdir():
  print(entries.name)

2007.csv
2007
2008.csv
2008
2009.csv
2009


## <font color = 'dodgerblue'>**Step 5: Create pandas dataframe**

In [None]:
# we will combine 2007.csv, 2008,csv and 2009.csv

# We use the below command to read all the csv files in the data directory 
# and store in a pandas dataframe.
# Create a dictionary to store different csv files

path = data_folder/'opinrank'/'cars'/'data'

df = {} # dictionary of dataframes
i = 0
for file in path.iterdir():
  if file.suffix == '.csv':
    df[i] = pd.read_csv(file)
    i += 1  

 

In [None]:
# Concatenate all datasets into a single dataframe
dataset = pd.concat(df, ignore_index=True)
dataset

Unnamed: 0,docid,year,num_reviews,FUEL,INTERIOR,EXTERIOR,BUILD,PERFORMANCE,COMFORT,RELIABILITY,FUN,overall_rating
0,2007_acura_mdx,2007,169,7.43,9.18,8.95,9.11,9.11,9.16,9.29,9.07,8.91
1,2007_acura_rdx,2007,164,6.52,9.30,8.99,9.24,9.24,9.02,9.36,9.38,8.88
2,2007_acura_rl,2007,22,7.77,9.86,9.64,9.86,9.18,9.59,9.64,9.41,9.37
3,2007_acura_tl,2007,151,8.01,9.25,9.42,8.87,9.03,9.15,9.36,9.23,9.04
4,2007_acura_tsx,2007,56,8.25,9.50,9.09,9.50,9.09,9.30,9.66,9.38,9.22
...,...,...,...,...,...,...,...,...,...,...,...,...
593,2009_volkswagen_passat,2009,18,8.11,8.72,9.22,8.72,8.89,8.78,8.33,9.06,8.73
594,2009_volkswagen_rabbit,2009,30,7.23,9.27,8.97,9.10,9.03,8.87,8.73,9.27,8.81
595,2009_volkswagen_routan,2009,47,7.70,9.21,9.32,8.64,8.87,9.62,8.94,9.26,8.94
596,2009_volkswagen_tiguan,2009,88,8.20,9.24,9.48,9.23,9.39,9.33,9.11,9.45,9.18


# <font color = 'dodgerblue'>**Summary:**

- Step1: use !wget to download data files from URl
- Step2: check content of folder where data was downloaded (iterate over Pathlib path using iterdir() and use file.name attribute to check name of files/folders)
- Step3: Check content of zipped/tar folder.
 - Open file using  with ```zipfile.ZipFile(file, 'r')``` or with  ```tarfile.open(file, 'r')```
 - check content using ``file.namelist()`` for zip files and ``file.getnames()`` for tar files)
- Step 4: unzip/untar files
 - open zip/tar files
 - use ``extract``/``extractall`` for  single/multiple files
- Step 4a: Understand the structure of unzipped folder/files (only required if data ia stored in multiple folders and files).
- Step 5: Check content of the main file using head command.
- Step 6a: combine all text files ( if data is stored in multiple files).
- Step6: Create pandas dataframe

Note : We can read the content of text file using 

```
with open(path/file,'r') as f:
 text = f.read()
```










# References 

https://realpython.com/working-with-files-in-python/#directory-listing-in-modern-python-versions

https://realpython.com/python-pathlib/#moving-and-deleting-files