# Jupyter Notebook: Patent Trends

## Cleaning Webscraped data from Venture Beat (https://venturebeat.com/)
**This notebook imports the csv file produced by the Selenium webscraping script (venturebeat_script.py) for preprocessing** <br>
Datasets Needed: venturebeat_results2.csv <br>
Methods: <br>
1.  Filter out illogical data <br>
    * User defined function to ID foul observations, validate proper observations  <br>
    * Regex functions to extract and keep only proper observations   <br>
2.  Extract and transform funding found from website Headers;  <br>
    * Regex functions to extract money: characters, abbreveations, and numbers  <br>
    * Drop observations with no funding  <br>
    * User defined functions to transform funding into numeric values  <br>
3.  Leverage NLP to extract main topics from website abstracts <br>
    * Spacy PyTextRank used to find top 3 topics <br>


In [1]:
import pandas as pd
import numpy as np 
import pytextrank
import spacy
import re
from datetime import datetime
import locale 
from decimal import Decimal
import warnings

warnings.filterwarnings('ignore')
# locale.setlocale(locale.LC_ALL, '')

In [2]:
df = pd.read_csv('venturebeat_results2.csv')
df = df.astype(str)

In [3]:
print('Shape of the VentureBeat dataframe is: {}'.format(df.shape))

Shape of the VentureBeat dataframe is: (2676, 4)


In [4]:
df.head(5)

Unnamed: 0,Date,Header,SubHeader,Abstract
0,12/5/2021,Sense raises $50M to bolster recruitment effor...,"Hear from CIOs, CTOs, and other C-level and se...",Recruiting is a top concern for enterprises in...
1,12/2/2021,AI-powered ecommerce platform Convious raises ...,"Hear from CIOs, CTOs, and other C-level and se...","Convious , the Amsterdam-based company that of..."
2,12/2/2021,Replai uses computer vision and data analysis ...,"Hear from CIOs, CTOs, and other C-level and se...",Replai automates analysis of video ad effec...
3,12/2/2021,Smartling lands $160M to help companies transl...,"Hear from CIOs, CTOs, and other C-level and se...","As the pandemic drives businesses online, tran..."
4,12/2/2021,Digital Insulin Management Company Hygieia Clo...,,"LIVONIA, Mich.–(BUSINESS WIRE)–December 2, 202..."


In [5]:
df.dtypes

Date         object
Header       object
SubHeader    object
Abstract     object
dtype: object

##### Some scraped data was formatted different, had text strings come in as Date variables
* Function is_date will take in a df and ID what rows are dates and return the index of those that aren't
* Function also returns the number of rows that are not dates - those to be dropped
* *Criteria: the date column must be the first one in the df* 

In [6]:
def is_date(data):
    data = data.astype(str)
    target = data.iloc[:,0].tolist()
    to_drop = []
    to_keep = []
    # i = 0
    for i in range(len(target)):
        match = re.match(r'.*([2][0-9]{3})', target[i])
        if match is None:
            to_drop.append(i)
            # i += 1
        else:
            to_keep.append(i)
            # i += 1
    print('{} items were not dates'.format(len(to_drop)))
    print('{} items were dates'.format(len(to_keep)))
    return to_drop


In [7]:
is_date(df)

4 items were not dates
2672 items were dates


[110, 262, 356, 2603]

##### Filtering out instances where data was pulled incorrect
##### df2 = df with invalid dates dropped

In [8]:
df2 = df[df['Date'].str.contains(r'.*([2][0-9]{3})')]
df2.reset_index(drop=True, inplace=True)

##### Confirm all invalid 'Dates' have been dropped

In [9]:
is_date(df2)

0 items were not dates
2672 items were dates


[]

##### Filter through data column 'Header'; extract monetary values 
##### df2 now has additional column, Amount_Funded, which was extracted from header

In [10]:
df2['Amount_Funded'] = df2.apply(lambda row: re.findall(r"[$]\d+\s*\.*\-*\d*\s*\w*", row['Header']), axis=1)

##### Drop instances where there is no funding extracted from header

In [11]:
empty_funding = []
valid_funding = []
for i in range(len(df2)):
    if len(df2.Amount_Funded[i]) == 1:
        valid_funding.append(i)
    else:
        empty_funding.append(i)
print('{} rows do not have funding info in the header. These will be dropped' 
      '\n{} rows remaining with funding info'.format(len(empty_funding),len(valid_funding)))

1021 rows do not have funding info in the header. These will be dropped
1651 rows remaining with funding info


##### df3 is new df, dropped instances where no funding was found 

In [12]:
df3 = df2.iloc[valid_funding]
df3.reset_index(drop=True, inplace=True)

In [13]:
df3['Amount_Funded'] = df3['Amount_Funded'].str[0]

##### Money values extracted from the Header come in multiple forms
* Need to create a function to return numeric values from text strings with varying syntaxes 

##### Cell below IDs all types of suffixes we need to work with 

In [14]:
x = df3['Amount_Funded']
l = x.tolist()

pattern1 = '[$]\d+\.*\d*\ *\-*\ *\d*'
replace = ' ' 

suffix = []

for item in l:
    y = re.sub(pattern1, replace, item)
    suffix.append(y)

In [15]:
df_test_suffix = pd.DataFrame()
df_test_suffix['Suffix'] = suffix 
df_test_suffix.value_counts().to_frame()

Unnamed: 0_level_0,0
Suffix,Unnamed: 1_level_1
million,1127
M,411
Million,66
billion,27
B,10
,4
K,2
MM,1
kit,1
m,1


##### Need to include spaces, 'kit', 'per' and blanks in a seperate bucket - label as other to drop 

##### Now that we know all possible syntax variations, create function to take in any syntax and return numeric value 

In [16]:
def dolladollabills(x):
    l = x.tolist()
    
    thousand = ['k', 'K', 'Thousand', 'thousand']
    million = ['m', 'M', 'Million', 'million' ]
    billion = ['b', 'B', 'Billion', 'billion']
    other = [' ','', 'MM', 'kit', 'per']
    
    K = 1000
    M = 1000000
    B = 1000000000
    
    pattern1 = '[$]\d+\.*\d*\ *\-*\ *\d*'
    replace = ' ' 
    pattern2 = '\d+\.*\d*'
    
    numeric = []
    for item in l:
        x = re.findall(pattern2, item)
        numeric.append(float(x[0]))

    value = []
    for item in l:
        y = re.sub(pattern1, replace, item)

        for obj in million:
            if y.strip() == obj:
                value.append(M)
            else:
                pass
        
        for obj in billion:
            if y.strip() == obj:
                value.append(B)
            else:
                pass
        
        for obj in thousand:
            if y.strip() == obj:
                value.append(K)
            else:
                pass
        
        for obj in other:
            if y.strip() == obj:
                value.append(0)
            else:
                pass
                  
    converted_funds = []
    for i in range(len(l)):
        converted_funds.append(Decimal(numeric[i] * value[i]))
    return converted_funds #numeric, value 


In [17]:
df3['Clean_Funding'] = dolladollabills(df3['Amount_Funded']) 

##### Change dtype of Date variable to DateTime type

In [18]:
df3.Date = pd.to_datetime(df3['Date'])

In [19]:
df4 = df3[['Date','Header', 'Abstract', 'Clean_Funding']]

In [20]:
df4.head()

Unnamed: 0,Date,Header,Abstract,Clean_Funding
0,2021-12-05,Sense raises $50M to bolster recruitment effor...,Recruiting is a top concern for enterprises in...,50000000
1,2021-12-02,AI-powered ecommerce platform Convious raises ...,"Convious , the Amsterdam-based company that of...",12000000
2,2021-12-02,Smartling lands $160M to help companies transl...,"As the pandemic drives businesses online, tran...",160000000
3,2021-12-02,Digital Insulin Management Company Hygieia Clo...,"LIVONIA, Mich.–(BUSINESS WIRE)–December 2, 202...",17000000
4,2021-12-01,CyCognito nabs $100M to fight cyberattacks wit...,"CyCognito , a company developing bot technolog...",100000000


##### Create a function to extract the top 3 topics from the Abstract using TextRank from Spacy library 

In [21]:
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
def top_topics(text, max_items=3):

    # load a spaCy model, depending on language, scale, etc.
    doc = nlp(str(text))

    # examine the top-ranked phrases in the document
    top_ranked = []
    count = 0
    for phrase in doc._.phrases:
        count += 1
        if phrase.rank > 0 and count <= max_items:
            top_ranked.append(phrase.text)
        else:
            break


    return top_ranked

In [22]:
df4['Topics'] = df3.apply(lambda row: top_topics(row['Abstract']),axis = 1)

In [23]:
locale.setlocale(locale.LC_ALL, '')
print('Total Amount in funding: {}'.format(locale.currency(df4.Clean_Funding.sum(), grouping=True)))
print('Data ranges from {} to {}'.format(datetime.date(df4.Date.min()), datetime.date(df4.Date.max())))

Total Amount in funding: $715,466,725,000.00
Data ranges from 2009-12-21 to 2021-12-05


In [24]:
df4.reset_index(drop = False, inplace=True)

In [25]:
df5 = df4[['index','Header']]

In [26]:
df5

Unnamed: 0,index,Header
0,0,Sense raises $50M to bolster recruitment effor...
1,1,AI-powered ecommerce platform Convious raises ...
2,2,Smartling lands $160M to help companies transl...
3,3,Digital Insulin Management Company Hygieia Clo...
4,4,CyCognito nabs $100M to fight cyberattacks wit...
...,...,...
1646,1646,A.I. research firm Vicarious raises $15M in it...
1647,1647,Online ad player Rocket Fuel gains $50M fundin...
1648,1648,Ad optimizer Rocket Fuel lifts off with $10M f...
1649,1649,Revised video game financing list: 115 game co...


In [27]:
#df5.to_csv('VentureBeat-Processed.csv')