<a href="https://colab.research.google.com/github/TolulopeOyejide/PySpark/blob/main/Processing_Unstructured_Data_With_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# STRUCTURED DATA

Data that is already present in a row and column format or which can be
converted to rows and columns so that later it can fit nicely into a database is called STRUCTURED DATA.

Examples are CSV, TXT, XLS files.


*   These files have 'delimiter'
*   Missing values are represented as blanks in between the delimiters.





# UNSTRUCTURED DATA

Characteristics of  'unstructured data'
1.   Lines are not fixed width.
2.   There are HTML, image and pdf files.



In [None]:
# Reading TXT file, segregating each of the lines in it

filename = 'path\input.txt'

with open(filename) as fn:

# Read each line in the text file
  ln = fn.readline()


# Keep count of lines
lncnt = 1
while ln:
  print("Line {}: {}".format(lncnt, ln.strip()))
  ln = fn.readline()
  lncnt += 1

In [None]:
# Counting Word Frequency in the file using the 'counter function'

from collections import Counter

with open(r'pathinput2.txt') as f:
  p = Counter(f.read().split())
  print(p)


**DEALING WITH MISSING VALUES USING PYTHON**

In [None]:
# Missing values are using identified as 'NA' or 'NAN'
# 'NAN' means Not a number

In [None]:
# Generating a dataset with missing values
import pandas as pd
import numpy as np

df =pd.DataFrame(np.random.randn(5,3), index = ['a', 'c', 'e', 'f', 'h'], columns = ['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df)

        one       two     three
a -0.484313 -1.113501 -1.137357
b       NaN       NaN       NaN
c  0.153414  1.802058  0.924224
d       NaN       NaN       NaN
e  1.970253  1.086653  0.789157
f  1.634101  0.133779 -1.688031
g       NaN       NaN       NaN
h -1.673576  1.326032 -0.253502


In [None]:
# The first step for dealing with missing values in a dataset is to check for it.

print(df['one'].isnull()) #check for null values in column 'one'

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool


In [None]:
# Replacing 'NAN' with zeros

print (df.fillna(0))
# NB: We can also fill with any other values



        one       two     three
a -0.484313 -1.113501 -1.137357
b  0.000000  0.000000  0.000000
c  0.153414  1.802058  0.924224
d  0.000000  0.000000  0.000000
e  1.970253  1.086653  0.789157
f  1.634101  0.133779 -1.688031
g  0.000000  0.000000  0.000000
h -1.673576  1.326032 -0.253502


In [None]:
# Dropping Missing Values
print (df.dropna())

        one       two     three
a -0.484313 -1.113501 -1.137357
c  0.153414  1.802058  0.924224
e  1.970253  1.086653  0.789157
f  1.634101  0.133779 -1.688031
h -1.673576  1.326032 -0.253502


In [None]:
# Replace missing or generic values
df2 = pd.DataFrame ({'one': [10,20,30,40,50,2000], 'two': [1000,0,30,40,50,60]})
print(df2.replace({1000:10,2000:60}))

   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60


**PROCESSING CSV DATA**



1.   CSV means Comma Seperated Values
2.   CSV file is a text file in which values in the columns are seperated by commas
3. The file can be created using windows notepad




In [None]:
from google.colab import files
uploaded = files.upload()

Saving input.csv to input.csv


In [None]:
import pandas as pd
data = pd.read_csv('input.csv')

In [None]:
print(data)

   id    name  salary  start_date        dept
0   1    Rick  623.30  2012-01-01          IT
1   2     Dan  515.20  2013-09-23  Operations
2   3   Tusar  611.00  2014-11-15          IT
3   4    Ryan  729.00  2014-05-11          HR
4   5    Gary  843.25  2015-03-27     Finance
5   6   Rasmi  578.00  2013-05-21          IT
6   7  Pranab  632.80  2013-07-30  Operations
7   8    Guru  722.50  2014-06-17     Finance


In [None]:
# Reading the first five role of the column salary
print ( data[0:5]['salary'])

0    623.30
1    515.20
2    611.00
3    729.00
4    843.25
Name: salary, dtype: float64


In [None]:
# Display 'salary' and 'name' columns for all the rows
print (data.loc[:, ['salary', 'name']]) # using the multl-axes indexes functions

   salary    name
0  623.30    Rick
1  515.20     Dan
2  611.00   Tusar
3  729.00    Ryan
4  843.25    Gary
5  578.00   Rasmi
6  632.80  Pranab
7  722.50    Guru


In [None]:
# Display 'salary' and 'name' columns of some rows
print (data.loc[[1,3,5],['salary', 'name']]) # using the multl-axes indexes functions

   salary   name
1   515.2    Dan
3   729.0   Ryan
5   578.0  Rasmi


In [None]:
# Display 'salary' and 'name' columns of some rows
print (data.loc[2:6,['salary', 'name']]) # using the multl-axes indexes functions

   salary    name
2  611.00   Tusar
3  729.00    Ryan
4  843.25    Gary
5  578.00   Rasmi
6  632.80  Pranab


**PROCESSING JSON DATA**


1.   JSON file stores data as text in human-readable formation
2.   JSON stands for JavaScript Object Notation.
3.   JSON file can be created using windows notepad.



In [None]:
from google.colab import files
uploaded = files.upload()

Saving input.json to input.json


In [None]:
from google.colab import files
uploaded = files.upload()

Saving input.json to input.json


In [None]:
import pandas as pd
data2 = pd.read_json('input.json')


**PROCESSING EXCEL DATA**



In [None]:
from google.colab import files
uploaded = files.upload()

Saving The excel file.xlsx to The excel file.xlsx


In [None]:
import pandas as pd
dt = pd.read_excel('The excel file.xlsx')

In [None]:
# Reading specific columns and rows
print(dt.loc[[1,3,5], ['salary', 'name']])

   salary   name
1   515.2    Dan
3   729.0   Ryan
5   578.0  Rasmi


In [None]:
# Reading multiple excel sheets

with pd.ExcelFile('the file path copied') as xls:
  df1 = pd.read_excel(xls, 'sheet 1')
  df2 = pd.read_excel(xls, 'sheet 2')

**PYTHON RELATIONAL DATABASES**

In [None]:
#Installing anaconda

!pip install -q condacolab
import condacolab
condacolab.install()

⏬ Downloading https://github.com/conda-forge/miniforge/releases/download/23.1.0-1/Mambaforge-23.1.0-1-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:15
🔁 Restarting kernel...


In [None]:
import sys
print(sys.executable)

/usr/bin/python3.real


In [None]:
%pip install sqlalchemy

Collecting sqlalchemy
  Downloading SQLAlchemy-2.0.19-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.7 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.7 MB[0m [31m21.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.7/2.7 MB[0m [31m45.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-extensions>=4.2.0
  Downloading typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting greenlet!=0.4.17
  Downloading greenlet-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (613 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m613.7/613.7 kB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected pac

In [None]:
from sqlalchemy import create_engine

In [None]:
import pandas as pd

In [None]:
dt = pd.read_csv('input.csv')

In [None]:
# Create the database engine
engine = create_engine('sqlite:///:memory:')

#The can also be used to connect to MySQL, Oracle, PostgreSQL, MSSQL

In [None]:
#Store the dataframe as a table
data.to_sql('dt', engine)

8

In [None]:
# Query 1 on the relational table
def read_table(dt):
 res1 = pd.read_sql_query('SELECT * FROM dt', engine)


**PYTHON No SQL Database**

In [None]:
%pip install pymongo

Collecting pymongo
  Downloading pymongo-4.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (603 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/603.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m593.9/603.6 kB[0m [31m25.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m603.6/603.6 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dnspython<3.0.0,>=1.16.0
  Downloading dnspython-2.4.1-py3-none-any.whl (300 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/300.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.3/300.3 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.4.1 pymongo-4.4.1
[0m

In [None]:
from pymongo import MongoClient
from pprint import pprint

In [None]:
client = MongoClient #choose the appropriate client

**TIME AND DATE**

In [None]:
import datetime

In [None]:
print (datetime.datetime.today())

2023-08-09 02:17:08.451040


In [None]:
print ('This Year :', date_today.year)
print ('This Month :', date_today.month)
print ( 'Month Name:', date_today.strftime('%B') )
print ('This Week Day :', date_today.day )
print ('Week Day Name:', date_today.strftime('%A'))

NameError: ignored

**Date Time Arithmetic**

In [None]:
import datetime

In [None]:
# Capture the first Date
day1 = datetime.date(2018, 2, 12)
print ('day1:', day1.ctime())

day1: Mon Feb 12 00:00:00 2018


In [None]:
# Capture the Second Date
day2 = datetime.date(2018, 2, 12)
print ('day2:', day2.ctime())

day2: Mon Feb 12 00:00:00 2018


In [None]:
# Finding the difference between the dates
print ('Number of Days:',day1 -day2)

Number of Days: 0:00:00


In [None]:
date_today = datetime.date.today()

In [None]:
# Create a delta of four days
no_of_days = datetime.timedelta(days=4)

In [None]:
# Use Delta For Past Date
before_four_days = date_today - no_of_days
print ('Before Four Days:', before_four_days)

Before Four Days: 2023-08-05


In [None]:
# Use Delta For Future Date
after_four_days = date_today + no_of_days
print ('After Four Days:', after_four_days)

After Four Days: 2023-08-13


**Date Comparison**

In [None]:
if day1 == before_four_days:
  print('Same Dates')

if date_today > day1:
  print('Same Dates')

if day1 < after_four_days:
  print('Future Date')

Same Dates
Future Date


**DATA WRANGLING WITH PYTHON**: This involves processing the data in various formats like - merging, grouping, concatenating



In [None]:
# Creating two dataframes and perfomring merging operations

import pandas as pd
left = pd.DataFrame({
         'id':[1,2,3,4,5],
         'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
         'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
         {'id':[1,2,3,4,5],
         'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
         'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print (left)
print (right)

   id    Name subject_id
0   1    Alex       sub1
1   2     Amy       sub2
2   3   Allen       sub4
3   4   Alice       sub6
4   5  Ayoung       sub5
   id   Name subject_id
0   1  Billy       sub2
1   2  Brian       sub4
2   3   Bran       sub3
3   4  Bryce       sub6
4   5  Betty       sub5


In [None]:
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}

In [None]:
import pandas as pd
df = pd.DataFrame(ipl_data) # Converting the data to dataframe

In [None]:
# GROUPING A DATASET
grouped = df.grouped('Year') #To group data by
print (grouped.get_group(2014))

# if dataset A = One
# if dataset B = Two

# CONCATENATE TWO DATA SSET:
# Concatenate = pd.concat([one,two])

# AGGREGATING A DATAFRAME:
# Aggregated Table = df.aggregate(np.sum)  -- Aggregating the entire table
# Aggregated A Column = df['column'].aggregate(np.sum)  -- Aggregating a column of the table
# Aggregated A Column = df['column A','column B'].aggregate(np.sum)  -- Aggregating columns of the table

**READING HTML PAGES**

In [None]:
%pip install BeautifulSoup

Collecting BeautifulSoup
  Using cached BeautifulSoup-3.2.2.tar.gz (32 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [None]:
import urllib2

**WORD TOKENIZATION**

*   It is all about splitting a large sample of text into words.



In [None]:
%pip conda install -c anaconda nltk

ERROR: unknown command "conda"


In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"

In [None]:
nltk_tokens = nltk.word_tokenize(word_data)


In [None]:
print (nltk_tokens)

['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the', 'comforts', 'of', 'their', 'drawing', 'rooms']


**Tokenizing Sentences**

In [None]:
sentence_data = "Sun rises in the east. Sun sets in the west."

nltk_tokens = nltk.sent_tokenize(sentence_data)

In [None]:
print(nltk_tokens)

['Sun rises in the east.', 'Sun sets in the west.']
