# Module 7: Data Wrangling with Pandas

**CPE 311 Computational Thinking with Python**

Submitted by: Sumilang, Kenneth Ian G.

Performed on: 7/4/24

Submitted on: 7/4/24

Submitted to: Engr. Roman M. Richard

## 7.1 Supplementary Activity 

Using the datasets provided, perform the following exercises:

### Exercise 1



We want to look at data for the Facebook, Apple, Amazon, Netflix, and Google (FAANG) stocks, but we were given each as a separate CSV file. Combine them into a single file and store the dataframe of the FAANG data as faang for the rest of the exercises:

1. Read each file in.
2. Add a column to each dataframe, called ticker, indicating the ticker symbol it is for (Apple's is AAPL, for example). This is how you look up a stock. Each file's name is also the ticker symbol, so be sure to capitalize it.
3. Append them together into a single dataframe.
4. Save the result in a CSV file called faang.csv.

In [2]:
import pandas as pd

# File names
apple = 'aapl.csv'
amazon = 'amzn.csv'
facebook = 'fb.csv'
google = 'goog.csv'
netflix = 'nflx.csv'

# Read each file into a DataFrame and add the ticker column
df_apple = pd.read_csv(apple)
df_apple['ticker'] = 'AAPL'

df_amazon = pd.read_csv(amazon)
df_amazon['ticker'] = 'AMZN'

df_facebook = pd.read_csv(facebook)
df_facebook['ticker'] = 'FB'

df_google = pd.read_csv(google)
df_google['ticker'] = 'GOOG'

df_netflix = pd.read_csv(netflix)
df_netflix['ticker'] = 'NFLX'

# Append them together into a single DataFrame
faang = pd.concat([df_apple, df_amazon, df_facebook, df_google, df_netflix], ignore_index=True)

# Save the result in a CSV file called faang.csv
faang.to_csv('faang.csv', index=False)

print("FAANG data has been combined and saved to 'faang.csv'.")


FAANG data has been combined and saved to 'faang.csv'.


### Exercise 2

* With faang, use type conversion to change the date column into a datetime and the volume column into integers. Then, sort by date and ticker.
* Find the seven rows with the highest value for volume.
* Right now, the data is somewhere between long and wide format. Use melt() to make it completely long format. Hint: date and ticker are our ID variables (they uniquely identify each row). We need to melt the rest so that we don't have separate columns for open, high, low, close, and volume.


In [3]:
import pandas as pd

# load the FAANG dataset

FAANG = pd.read_csv('faang.csv')

# Convert the 'date' column to 'datetime'

FAANG['date'] = pd.to_datetime(FAANG['date'])

# Convert the 'volume' column to integers.

FAANG['volume'] = FAANG['volume'].astype(int)

# Sort by 'date' and 'ticker'

FAANG = FAANG.sort_values(by=['date','ticker'])

# Find the seven rows with the highest value for 'volume'

top7 = FAANG.nlargest(7, 'volume')

top7

Unnamed: 0,date,open,high,low,close,volume,ticker
644,2018-07-26,174.89,180.13,173.75,176.26,169803668,FB
555,2018-03-20,167.47,170.2,161.95,168.15,129851768,FB
559,2018-03-26,160.82,161.1,149.02,160.06,126116634,FB
556,2018-03-21,164.8,173.4,163.3,169.39,106598834,FB
182,2018-09-21,219.0727,219.6482,215.6097,215.9768,96246748,AAPL
245,2018-12-21,156.1901,157.4845,148.9909,150.0862,95744384,AAPL
212,2018-11-02,207.9295,211.9978,203.8414,205.8755,91328654,AAPL


#### Melting the DataFrame

In [5]:
F_LONG = pd.melt(FAANG, id_vars=['date','ticker'],value_vars=['open','high','low','close','volume'])

# Displaying the first few rows of the melted DataFrame
F_LONG

Unnamed: 0,date,ticker,variable,value
0,2018-01-02,AAPL,open,1.669271e+02
1,2018-01-02,AMZN,open,1.172000e+03
2,2018-01-02,FB,open,1.776800e+02
3,2018-01-02,GOOG,open,1.048340e+03
4,2018-01-02,NFLX,open,1.961000e+02
...,...,...,...,...
6270,2018-12-31,AAPL,volume,3.500347e+07
6271,2018-12-31,AMZN,volume,6.954507e+06
6272,2018-12-31,FB,volume,2.462531e+07
6273,2018-12-31,GOOG,volume,1.493722e+06


Exercise 3


* Using web scraping, search for the list of the hospitals, their address and contact information. Save the list in a new csv file, hospitals.csv.
* Using the generated hospitals.csv, convert the csv file into pandas dataframe. Prepare the data using the necessary preprocessing techniques.

In [19]:
# 1. Retrieve PDF
import requests

url = 'https://www.philhealth.gov.ph/partners/providers/institutional/accredited/HOSP_073123.pdf'

req = requests.get(url)

with open('hospitals.pdf','wb') as f:
    f.write(req.content)

In [11]:
# install library that can extract stuff from a pdf first
!pip install pdfplumber 



In [91]:
#2. Extract Data from the pdf
import pdfplumber as pdfile
import pandas as pd

with pdfile.open('hospitals.pdf') as pdf:
    #Extract tables from the first 
    pages = pdf.pages[:106]  # Get the first 106 pages
    tables = [table for page in pages for table in page.extract_tables()]


In [92]:
tables #display 'tables'

[[['',
   'NAME OF HEALTH FACILITY',
   'BEDS',
   'CAT',
   'TEL_NO',
   'EMAIL',
   'STREET',
   'MUNICIPALITY',
   'SEC'],
  ['CORDILLERA ADMINISTRATIVE REGION',
   None,
   None,
   None,
   None,
   '',
   None,
   None,
   None],
  ['ABRA', None, None, None, None, None, None, None, None],
  ['1',
   'ABRA PROVINCIAL HOSPITAL',
   '50',
   'LEVEL 1',
   '0747527509',
   'abraprovincialhospital@yahoo.c\nom.ph',
   'CAPITULACION STREET, CALABA',
   'BANGUED',
   'G'],
  ['2',
   'ASSUMPTA FAMILY HOSPITAL',
   '17',
   'LEVEL 1',
   '09177997644',
   'assumpta_815@yahoo.com.ph',
   'MAGALLANES ST., ZONE 5',
   'BANGUED',
   'P'],
  ['3',
   'BANGUED CHRISTIAN HOSPITAL',
   '17',
   'LEVEL 1',
   '0747525644',
   'banguedchristianhospital@yaho\no.com.ph',
   'TORRIJOS STREET, ZONE 5',
   'BANGUED',
   'P'],
  ['4',
   'DR. PETRONILO V. SEARES SR.\nMEMORIAL HOSPITAL',
   '24',
   'LEVEL 1',
   '09353388157 /\n09677608595',
   'petronilo_seares@yahoo.com.p\nh',
   'PEÑARRUBIA ST., ZONE 

In [95]:
# I think we can turn this into a dataframe

if tables:
    df = pd.DataFrame(tables[0])
else:
    print("No tables in the PDF")

#display
df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,,NAME OF HEALTH FACILITY,BEDS,CAT,TEL_NO,EMAIL,STREET,MUNICIPALITY,SEC
1,CORDILLERA ADMINISTRATIVE REGION,,,,,,,,
2,ABRA,,,,,,,,
3,1,ABRA PROVINCIAL HOSPITAL,50,LEVEL 1,0747527509,abraprovincialhospital@yahoo.c\nom.ph,"CAPITULACION STREET, CALABA",BANGUED,G
4,2,ASSUMPTA FAMILY HOSPITAL,17,LEVEL 1,09177997644,assumpta_815@yahoo.com.ph,"MAGALLANES ST., ZONE 5",BANGUED,P
5,3,BANGUED CHRISTIAN HOSPITAL,17,LEVEL 1,0747525644,banguedchristianhospital@yaho\no.com.ph,"TORRIJOS STREET, ZONE 5",BANGUED,P
6,4,DR. PETRONILO V. SEARES SR.\nMEMORIAL HOSPITAL,24,LEVEL 1,09353388157 /\n09677608595,petronilo_seares@yahoo.com.p\nh,"PEÑARRUBIA ST., ZONE 4",BANGUED,P
7,5,LA PAZ DISTRICT HOSPITAL,12,INF/DISP,09273096506,lpdhpho.pga@gmail.com,POBLACION,LA PAZ,G
8,6,ST. THERESA WELLNESS CENTER,12,INF/DISP,09778268084,sttheresawellnesscenter@gmail.\ncom,BARANGAY TALOGTOG,DOLORES,P
9,7,VALERA MEDICAL HOSPITAL,15,INF/DISP,0747255595,valeramed_hospital@yahoo.co\nm,"RIZAL STREET, ZONE 7",BANGUED,P


In [96]:
# I noticed that there are rows that divide the dataframe into regions.
# I don't really need it though. Let's use a for loop that will drop rows with 'None' values

df_drops=df.dropna(thresh=df.shape[1]-2)
df_drops

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,,NAME OF HEALTH FACILITY,BEDS,CAT,TEL_NO,EMAIL,STREET,MUNICIPALITY,SEC
3,1.0,ABRA PROVINCIAL HOSPITAL,50,LEVEL 1,0747527509,abraprovincialhospital@yahoo.c\nom.ph,"CAPITULACION STREET, CALABA",BANGUED,G
4,2.0,ASSUMPTA FAMILY HOSPITAL,17,LEVEL 1,09177997644,assumpta_815@yahoo.com.ph,"MAGALLANES ST., ZONE 5",BANGUED,P
5,3.0,BANGUED CHRISTIAN HOSPITAL,17,LEVEL 1,0747525644,banguedchristianhospital@yaho\no.com.ph,"TORRIJOS STREET, ZONE 5",BANGUED,P
6,4.0,DR. PETRONILO V. SEARES SR.\nMEMORIAL HOSPITAL,24,LEVEL 1,09353388157 /\n09677608595,petronilo_seares@yahoo.com.p\nh,"PEÑARRUBIA ST., ZONE 4",BANGUED,P
7,5.0,LA PAZ DISTRICT HOSPITAL,12,INF/DISP,09273096506,lpdhpho.pga@gmail.com,POBLACION,LA PAZ,G
8,6.0,ST. THERESA WELLNESS CENTER,12,INF/DISP,09778268084,sttheresawellnesscenter@gmail.\ncom,BARANGAY TALOGTOG,DOLORES,P
9,7.0,VALERA MEDICAL HOSPITAL,15,INF/DISP,0747255595,valeramed_hospital@yahoo.co\nm,"RIZAL STREET, ZONE 7",BANGUED,P
10,8.0,VILLAVICIOSA MEDICARE AND\nCOMMUNITY HOSPITAL,12,INF/DISP,09168911087,vmchospital2008@yahoo.com.p\nh,AP-APAYA,VILLAVICIOSA,G
12,9.0,AMMA JADSAC DISTRICT HOSPITAL,18,INF/DISP,09161292808,ajdh.hospital@gmail.com,BARANGAY POBLACION,PUDTOL,G


In [97]:
#melt the wide table and what remains is column index 1 (name of hospital), 
#combined index of [6,7] (location) and index
#4 (contact number)

# Let's get Column 1 first.

name = df_drops[1]
name


0                               NAME OF HEALTH FACILITY
3                              ABRA PROVINCIAL HOSPITAL
4                              ASSUMPTA FAMILY HOSPITAL
5                            BANGUED CHRISTIAN HOSPITAL
6        DR. PETRONILO V. SEARES SR.\nMEMORIAL HOSPITAL
7                              LA PAZ DISTRICT HOSPITAL
8                           ST. THERESA WELLNESS CENTER
9                               VALERA MEDICAL HOSPITAL
10        VILLAVICIOSA MEDICARE AND\nCOMMUNITY HOSPITAL
12                        AMMA JADSAC DISTRICT HOSPITAL
13                             APAYAO DISTRICT HOSPITAL
14                           APAYAO PROVINCIAL HOSPITAL
15                             CONNER DISTRICT HOSPITAL
16    FAR NORTH LUZON GENERAL\nHOSPITAL AND TRAINING...
17                              FLORA DISTRICT HOSPITAL
18                       STA. MARCELA DISTRICT HOSPITAL
20                               ATOK DISTRICT HOSPITAL
21          BAGUIO GENERAL HOSPITAL AND\nMEDICAL

In [98]:
# Next lets get column 6 and 7 and combine them together
street = df_drops[6]
muni = df_drops[7]
street


0                                 STREET
3            CAPITULACION STREET, CALABA
4                 MAGALLANES ST., ZONE 5
5                TORRIJOS STREET, ZONE 5
6                 PEÑARRUBIA ST., ZONE 4
7                              POBLACION
8                      BARANGAY TALOGTOG
9                   RIZAL STREET, ZONE 7
10                              AP-APAYA
12                    BARANGAY POBLACION
13                 ABASAG ST., POBLACION
14    DUCRAO STREET, BARANGAY\nPOBLACION
15                                RIPANG
16                         BRGY. QUIRINO
17                              BAGUTONG
18                            SAN CARLOS
20                       SAYANGAN, PAOAY
21                   GOVERNOR PACK ROAD,
Name: 6, dtype: object

In [99]:
muni

0          MUNICIPALITY
3               BANGUED
4               BANGUED
5               BANGUED
6               BANGUED
7                LA PAZ
8               DOLORES
9               BANGUED
10         VILLAVICIOSA
12               PUDTOL
13    CALANASAN (BAYAG)
14              KABUGAO
15               CONNER
16                 LUNA
17                FLORA
18         STA. MARCELA
20                 ATOK
21          BAGUIO CITY
Name: 7, dtype: object

In [100]:
# use concatenation
location = street.fillna('') + ' | ' + muni.fillna('')
df_drops['location'] = location

location

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_drops['location'] = location


0                            STREET | MUNICIPALITY
3            CAPITULACION STREET, CALABA | BANGUED
4                 MAGALLANES ST., ZONE 5 | BANGUED
5                TORRIJOS STREET, ZONE 5 | BANGUED
6                 PEÑARRUBIA ST., ZONE 4 | BANGUED
7                               POBLACION | LA PAZ
8                      BARANGAY TALOGTOG | DOLORES
9                   RIZAL STREET, ZONE 7 | BANGUED
10                         AP-APAYA | VILLAVICIOSA
12                     BARANGAY POBLACION | PUDTOL
13       ABASAG ST., POBLACION | CALANASAN (BAYAG)
14    DUCRAO STREET, BARANGAY\nPOBLACION | KABUGAO
15                                 RIPANG | CONNER
16                            BRGY. QUIRINO | LUNA
17                                BAGUTONG | FLORA
18                       SAN CARLOS | STA. MARCELA
20                          SAYANGAN, PAOAY | ATOK
21               GOVERNOR PACK ROAD, | BAGUIO CITY
dtype: object

In [101]:
#last column: contact number
contact_no = df_drops[4]

contact_no

0                         TEL_NO
3                     0747527509
4                    09177997644
5                     0747525644
6     09353388157 /\n09677608595
7                    09273096506
8                    09778268084
9                     0747255595
10                   09168911087
12                   09161292808
13                   09353581439
14                   09457715427
15                   09952093932
16                    0746340074
17                   09287301508
18                   09177703602
20                   09209507875
21                    0746617932
Name: 4, dtype: object

In [102]:
fin = pd.DataFrame({
    'Hospital Name': name,
    'Location': location,
    'Contact Number/s.': contact_no
})
# removing row 0 because it is unecessary
fin = fin.drop(0)
#since the dataframe will start at index 3, we reset the row indices

fin.reset_index(drop=True, inplace=True)



In [103]:
fin

Unnamed: 0,Hospital Name,Location,Contact Number/s.
0,ABRA PROVINCIAL HOSPITAL,"CAPITULACION STREET, CALABA | BANGUED",0747527509
1,ASSUMPTA FAMILY HOSPITAL,"MAGALLANES ST., ZONE 5 | BANGUED",09177997644
2,BANGUED CHRISTIAN HOSPITAL,"TORRIJOS STREET, ZONE 5 | BANGUED",0747525644
3,DR. PETRONILO V. SEARES SR.\nMEMORIAL HOSPITAL,"PEÑARRUBIA ST., ZONE 4 | BANGUED",09353388157 /\n09677608595
4,LA PAZ DISTRICT HOSPITAL,POBLACION | LA PAZ,09273096506
5,ST. THERESA WELLNESS CENTER,BARANGAY TALOGTOG | DOLORES,09778268084
6,VALERA MEDICAL HOSPITAL,"RIZAL STREET, ZONE 7 | BANGUED",0747255595
7,VILLAVICIOSA MEDICARE AND\nCOMMUNITY HOSPITAL,AP-APAYA | VILLAVICIOSA,09168911087
8,AMMA JADSAC DISTRICT HOSPITAL,BARANGAY POBLACION | PUDTOL,09161292808
9,APAYAO DISTRICT HOSPITAL,"ABASAG ST., POBLACION | CALANASAN (BAYAG)",09353581439


In [104]:
fin.to_csv('hospitals.csv',index=False)
print("Completed")

Completed


7.2 Conclusion:

I learned how to use Pandas more extensively to create my own Data Frame which is basic data cleaning. As someone