<a href="https://colab.research.google.com/github/Kussil/CVX_Rice_project/blob/main/Text_Data/Text_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import Libraries and Clone Github

In [1]:
# Import Libraries
import os
from google.colab import userdata
import pandas as pd

In [2]:
# Import github token with google secrets thingy and clone git repository
GITHUB_TOKEN = userdata.get('github')
os.environ['GITHUB_TOKEN'] = GITHUB_TOKEN
!git clone https://{GITHUB_TOKEN}@github.com/Kussil/CVX_Rice_project.git

Cloning into 'CVX_Rice_project'...
remote: Enumerating objects: 162, done.[K
remote: Counting objects: 100% (100/100), done.[K
remote: Compressing objects: 100% (93/93), done.[K
remote: Total 162 (delta 52), reused 9 (delta 4), pack-reused 62[K
Receiving objects: 100% (162/162), 58.28 MiB | 7.71 MiB/s, done.
Resolving deltas: 100% (75/75), done.
Updating files: 100% (21/21), done.


## Upload Investment Research Articles into DF

In [3]:
# Import Chevron and Other Majors
invest_df_cvx = pd.read_csv('CVX_Rice_project/Text_Data/Investment Research-CVX.csv')
invest_df_majors = pd.read_csv('CVX_Rice_project/Text_Data/Investment Research-Majors.csv')

# Append dfs
invest_df = pd.concat([invest_df_cvx, invest_df_majors], ignore_index=True)
display(invest_df_cvx.shape)
display(invest_df_majors.shape)
display(invest_df.shape)

# Rename drop, and reorder columns
invest_df = invest_df.rename(columns={'Date/Time': 'Date', 'Company': 'Ticker', 'Headline': 'Article Headline', 'Text': 'Article Text'})
invest_df = invest_df.drop(['Contributor', 'Analyst', 'Pages'], axis=1)
new_order = ['Ticker', 'Date', 'Article Headline', 'Article Text']
invest_df = invest_df.reindex(columns=new_order)
display(invest_df.head())

# Note, still missing ~400 from majors and non majors

(362, 7)

(2630, 7)

(2992, 7)

Unnamed: 0,Ticker,Date,Article Headline,Article Text
0,CVX,"May 13, 2024 10:05 PM",Chevron Corporation,"Stock Report | March 12, 2022 | NYSE Symbol: C..."
1,CVX,"May 13, 2024 03:29 PM",CFRA LIFTS VIEW ON SHARES OF CHEVRON CORPORATI...,"Stock Report | August 05, 2023 | NYSE Symbol: ..."
2,CVX,"May 11, 2024 06:00 PM",Chevron Corporation,"Stock Report | March 04, 2023 | NYSE Symbol: C..."
3,CVX,"May 04, 2024 05:49 PM",Chevron Corporation,"Stock Report | October 17, 2020 | NYSE Symbol:..."
4,CVX,"May 01, 2024 10:03 PM",Chevron Corporation,"Stock Report | December 25, 2021 | NYSE Symbol..."


## Upload Proquest News Articles into DF and Clean Data

In [4]:
# Import Chevron and Others
proquest_df_cvx = pd.read_csv('CVX_Rice_project/Text_Data/proquest_newsarticles_CVX.csv')
proquest_df_others = pd.read_csv('CVX_Rice_project/Text_Data/proquest_newsarticles_all_v2.csv')  # Currently empty for some reason, fix later

# Append dfs
proquest_df = pd.concat([proquest_df_cvx, proquest_df_others], ignore_index=True)
display(proquest_df_cvx.shape)
display(proquest_df_others.shape)
display(proquest_df.shape)

# Rename drop, and reorder columns
proquest_df = proquest_df.rename(columns={'Title': 'Article Headline', 'Full Article Text': 'Article Text'})
proquest_df = proquest_df.drop(['URL'], axis=1)
proquest_df['Ticker'] = "TBD" # Temp column until ticker added
proquest_df = proquest_df.reindex(columns=new_order)
display(proquest_df.head())

(261, 4)

(1500, 4)

(1761, 4)

Unnamed: 0,Ticker,Date,Article Headline,Article Text
0,TBD,"Dec 4, 2020",Chevron General Motors Zoom At Amp T Stocks That,Turn on search term navigationTurn on search t...
1,TBD,"Dec 4, 2020",Chevron Slash Capital Spending,Turn on search term navigationTurn on search t...
2,TBD,"Dec 3, 2020",Chevron Slashes Spending Plans As Coronavirus,Turn on search term navigationTurn on search t...
3,TBD,"Nov 11, 2020",Senators Air Misgivings Over Malampaya Chevron,Relevant content not found within the specifie...
4,TBD,"Oct 29, 2020",Shell Tries Woo Investors With Dividend Raise,Turn on search term navigationTurn on search t...


In [5]:
# Delete rows with missing article text
search_text_1 = 'Failed to load content: Message:'
search_text_2 = 'Relevant content not found within the specified range.'

# Count the number of rows containing either search text
count_rows = proquest_df[proquest_df['Article Text'].str.contains(search_text_1, na=False) |
                         proquest_df['Article Text'].str.contains(search_text_2, na=False)].shape[0]

# Delete the rows containing either search text
proquest_df = proquest_df[~proquest_df['Article Text'].str.contains(search_text_1, na=False) &
                          ~proquest_df['Article Text'].str.contains(search_text_2, na=False)]

# Print the number of rows with missing article text and the new shape of the DataFrame
print(f"Number of rows with missing article text: {count_rows}")
print(f"New DataFrame shape: {proquest_df.shape}")
print()

# Confirm data is good by looking for short article headlines
shortest_headline = proquest_df.loc[proquest_df['Article Headline'].str.len().idxmin(), 'Article Headline']
print(f"The shortest article headline is: '{shortest_headline}'")
print()

# Confirm data is good by looking for short article text
shortest_text = proquest_df.loc[proquest_df['Article Text'].str.len().idxmin(), 'Article Text']
print(f"The shortest article text is: '{shortest_text}'")

Number of rows with missing article text: 141
New DataFrame shape: (1620, 4)

The shortest article headline is: 'Lng Dash'

The shortest article text is: 'Turn on search term navigationTurn on search term navigation
| Jump to first hitOKLAHOMA CITY Gulfport Energy Corp. said production during the third quarter averaged 1.5 billion cubic feet equivalent per day, a 12% increase over the second quarter of 2019.For the third quarter, Gulfport’s net daily production mix was 93% natural gas, 5% natural gas liquids and 2% oil.Gulfport’s realized prices for the third quarter of 2019 were $1.73 per thousand cubic feet of natural gas, $78.59 per barrel of oil and $0.45 per gallon of natural gas liquids, resulting in a total equivalent price of $2.04 per Mcfe. Gulfport's realized prices for the third quarter of 2019 include an aggregate non-cash derivative loss of $54.1 million.'


In [6]:
# Look for duplicate rows
duplicates = proquest_df[proquest_df.duplicated(subset=['Date', 'Article Headline'], keep=False)]
duplicate_count = duplicates.shape[0]
print(f"Number of duplicate rows: {duplicate_count}")
print(duplicates)
print()

# Drop duplicates and keep the first occurrence
proquest_df = proquest_df.drop_duplicates(subset=['Date', 'Article Headline'], keep='first')
display(proquest_df.shape)
display(proquest_df.head(20))
display(proquest_df.tail(20))

Number of duplicate rows: 244
     Ticker          Date                                 Article Headline  \
248     TBD   Oct 7, 2022    Markets Amp Finance Commodities Chevron Faces   
250     TBD   Oct 6, 2022    Chevron Faces Tough Job Restarting Venezuelas   
261     TBD   Nov 5, 2020  Us Oil Shares Climb As Renewables Sector Slides   
262     TBD   Nov 5, 2020  Us Oil Shares Climb As Renewables Sector Slides   
280     TBD  Oct 24, 2020        Trump Biden Play Rules Testy Final Debate   
...     ...           ...                                              ...   
1707    TBD  Mar 11, 2022                         Russian Gas Pipe Schemes   
1735    TBD   Mar 2, 2022                          Cheniere Lng Super Cool   
1736    TBD   Mar 2, 2022                          Cheniere Lng Super Cool   
1737    TBD   Mar 2, 2022                          Cheniere Lng Super Cool   
1738    TBD   Mar 2, 2022                          Cheniere Lng Super Cool   

                                 

(1459, 4)

Unnamed: 0,Ticker,Date,Article Headline,Article Text
0,TBD,"Dec 4, 2020",Chevron General Motors Zoom At Amp T Stocks That,Turn on search term navigationTurn on search t...
1,TBD,"Dec 4, 2020",Chevron Slash Capital Spending,Turn on search term navigationTurn on search t...
2,TBD,"Dec 3, 2020",Chevron Slashes Spending Plans As Coronavirus,Turn on search term navigationTurn on search t...
4,TBD,"Oct 29, 2020",Shell Tries Woo Investors With Dividend Raise,Turn on search term navigationTurn on search t...
14,TBD,"Oct 2, 2020",Virus Pain Persists Oil Companies,Turn on search term navigationTurn on search t...
15,TBD,"Oct 1, 2020",Pandemic Pain Persists Big Oil Companies Tepid,Turn on search term navigationTurn on search t...
17,TBD,"Aug 24, 2020",Norways Biggest Private Money Manager Exits Exxon,Turn on search term navigationTurn on search t...
18,TBD,"Aug 18, 2020",Chevron Pursues Iraq Oil Project Deals With,Turn on search term navigationTurn on search t...
19,TBD,"Aug 18, 2020",Chevron Pursues Exploration Deal Iraq Ge,Turn on search term navigationTurn on search t...
20,TBD,"Aug 17, 2020",Chevron Pursues Exploration Deal Iraq Ge,Turn on search term navigationTurn on search t...


Unnamed: 0,Ticker,Date,Article Headline,Article Text
1735,TBD,"Mar 2, 2022",Cheniere Lng Super Cool,"""Anybody who wants gas in 2021-22 had better s..."
1739,TBD,"Mar 2, 2022",War Threatens Gas Cheap Energy Supply Brazil,The war in Ukraine may hinder the opening of t...
1740,TBD,"Mar 2, 2022",United States Natural Gas Monthly Data December,"HighlightsDecember 2021In December 2021, dry n..."
1741,TBD,"Mar 1, 2022",Natural Gas Investments Hit 8 7Trn 2050,Natural gas can become the fuel of choice in s...
1742,TBD,"Mar 1, 2022",Lng Import At 32 Month Low On High Prices Rising,"New Delhi, February 28 STATES\nIndia’s liquefi..."
1743,TBD,"Feb 28, 2022",Europe Is Pivoting Away Russian Gas Why Cheniere,Russia's invasion of Ukraine has underscored m...
1744,TBD,"Feb 28, 2022",Energy Sanctions Are Weapon Putin Would,Sanctions against Vladimir Putin's war machine...
1747,TBD,"Feb 28, 2022",Why Europe Must Break Dependency On Russian Gas,Turn on search term navigationTurn on search t...
1748,TBD,"Feb 28, 2022",World Europes Reliance On Russian Fossil Fuels,Germany's vice chancellor is calling Russia's ...
1749,TBD,"Feb 27, 2022",Russia Sends Natural Gas Tankers Kaliningrad,Russia holds most of the cards when it comes t...


## Concatenate both DF Sources and Export to CSV

In [7]:
# Concatenate
text_df = pd.concat([invest_df, proquest_df], ignore_index=True)
display(text_df.shape)
display(text_df.head())

(4451, 4)

Unnamed: 0,Ticker,Date,Article Headline,Article Text
0,CVX,"May 13, 2024 10:05 PM",Chevron Corporation,"Stock Report | March 12, 2022 | NYSE Symbol: C..."
1,CVX,"May 13, 2024 03:29 PM",CFRA LIFTS VIEW ON SHARES OF CHEVRON CORPORATI...,"Stock Report | August 05, 2023 | NYSE Symbol: ..."
2,CVX,"May 11, 2024 06:00 PM",Chevron Corporation,"Stock Report | March 04, 2023 | NYSE Symbol: C..."
3,CVX,"May 04, 2024 05:49 PM",Chevron Corporation,"Stock Report | October 17, 2020 | NYSE Symbol:..."
4,CVX,"May 01, 2024 10:03 PM",Chevron Corporation,"Stock Report | December 25, 2021 | NYSE Symbol..."


In [8]:
# Export as CSV
text_df.to_csv('/content/Consolidated_Text_Data.csv', index=False)

# Note: This export needs to be manually uploaded to Github.  Hopefully will figure out a way to automate this later