# Challenges:
* links
* News
* Encode companies
* Encode people
* use text data embedding (word/document) or word freq 
* Remove unneeded columns (e.g. images,Deal announced on)
* Imputation
* Twitter link: get followers and latest post?

# Notes on data
* Address col is useless , there's already city,state and country cols
* Crunchbase links are protected with captcha , can't be scraped

imports

In [16]:
import pandas as pd
import numpy as np
import requests

Reviewing a sample row from each file

In [17]:
acquired = pd.read_csv("Data/Acquired Tech Companies.csv")
acquired.iloc[0]

Company                                                     Day Software
CrunchBase Profile     http://www.crunchbase.com/organization/day-sof...
Image                  http://a5.images.crunchbase.com/image/upload/c...
Tagline                Day Software develops web applications that al...
Year Founded                                                         NaN
Market Categories                                               Software
Address (HQ)           Barfüsserplatz 6, Basel, Basel-Stadt, Switzerland
City (HQ)                                                          Basel
State / Region (HQ)                                          Basel-Stadt
Country (HQ)                                                 Switzerland
Description            Day was founded in Basel, Switzerland, in 1993...
Homepage                                              http://www.day.com
Twitter                                                              NaN
Acquired by                                        

In [18]:
acquiring = pd.read_csv("Data/Acquiring Tech Companies.csv")
acquiring.iloc[0]

Acquiring Company                                                                        Adobe
CrunchBase Profile                               www.crunchbase.com/organization/adobe-systems
Image                                        http://a2.images.crunchbase.com/image/upload/c...
Tagline                                      Adobe is an American multinational computer so...
Market Categories                            Photo Editing, Design, Creative, Software, Ima...
Year Founded                                                                              1982
IPO                                                                                       1986
Founders                                                         John Warnock, Charles Geschke
Number of Employees                                                                     11,144
Number of Employees (year of last update)                                               2012.0
Total Funding ($)                                 

In [19]:
acquisitions = pd.read_csv("Data/Acquisitions.csv")
acquisitions.iloc[0]

Acquisitions ID                                      EMC acquired Data Domain in 2009
Acquired Company                                                          Data Domain
Acquiring Company                                                                 EMC
Year of acquisition announcement                                                 2009
Deal announced on                                                           8/07/2009
Price                                                                  $2,100,000,000
Status                                                                    Undisclosed
Terms                                                                            Cash
Acquisition Profile                 http://www.crunchbase.com/acquisition/5dc676a1...
News                                                         EMC acquired Data Domain
News Link                           http://www.businesswire.com/news/home/20090708...
Name: 0, dtype: object

In [20]:
founders = pd.read_csv("Data/Founders and Board Members.csv")
founders.iloc[0]

Name                                                 Hans-Werner Hector
CrunchBase Profile      http://de.wikipedia.org/wiki/Hans-Werner_Hector
Role                                                            Founder
Companies                                                           SAP
Image                 http://images.forbes.com/media/lists/10/2006/4...
Name: 0, dtype: object

We will link between the files using these columns:
* Acquisitions ID to link the acquisitions
* 'Founders' and 'Name' to link the Founders

In [6]:
np.intersect1d(acquired.columns, acquisitions.columns).tolist()

['Acquisitions ID']

In [7]:
np.intersect1d(acquiring.columns, acquisitions.columns).tolist()

['Acquiring Company', 'Acquisitions ID']

In [8]:
acquired.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 310 entries, 0 to 309
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Company              310 non-null    object 
 1   CrunchBase Profile   310 non-null    object 
 2   Image                287 non-null    object 
 3   Tagline              307 non-null    object 
 4   Year Founded         241 non-null    float64
 5   Market Categories    287 non-null    object 
 6   Address (HQ)         277 non-null    object 
 7   City (HQ)            275 non-null    object 
 8   State / Region (HQ)  273 non-null    object 
 9   Country (HQ)         276 non-null    object 
 10  Description          299 non-null    object 
 11  Homepage             273 non-null    object 
 12  Twitter              168 non-null    object 
 13  Acquired by          309 non-null    object 
 14  Acquisitions ID      310 non-null    object 
 15  API                  310 non-null    obj

In [9]:
def ValidateLink(url, timeout=15): 
    session = requests.Session() 
    # fake headers to make it seem like a real request
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
        'DNT': '1',
    }
    session.headers.update(headers) 
    try:
        response = session.get(url, timeout=timeout, allow_redirects=True, stream=True)
        status_code = response.status_code
        response.close() 
        if status_code < 400:
            return True
        else:
            return False
    except Exception as e:
        return False

In [None]:
def ValidateLinks(urls):
    results=[]
    for url in urls:
        results.append(ValidateLink(url))
        if results[-1]:
            return results
    return results

In [None]:
def ValidateLinksDF(df):
    for col in df.columns:
        for val in df[col]:
            if type(val)==str and  ('http' in val):
                print(col)
                results = ValidateLinks(df[col])
                if not pd.Series(results).any():
                    print(f'Column "{col}" had no valid links , or is using captcha.')
                    print('Try it yourself:')
                    print(df[col][0]+'\n')
                break

In [12]:
ValidateLinksDF(acquired)

CrunchBase Profile
Column "CrunchBase Profile" had no valid links , or is using captcha.
Try it yourself:
http://www.crunchbase.com/organization/day-software

Image
Column "Image" had no valid links , or is using captcha.
Try it yourself:
http://a5.images.crunchbase.com/image/upload/c_pad,h_500,w_500/v1397187256/0ef62e4243274cafe1a563dda2bca363.png

Homepage
Twitter
API


* CrunchBase is using CAPTCHA , so we won't drop it now but we will proccess it later
* Image links are all corrupt so we will drop the column 

In [21]:
acquired=acquired.drop('Image',axis=1)

In [13]:
ValidateLinksDF(acquiring)

Image
Column "Image" had no valid links , or is using captcha.
Try it yourself:
http://a2.images.crunchbase.com/image/upload/c_pad,h_500,w_500/v1397180657/2cd912e176145af3618549d60b7959a1.png

Homepage
Twitter
API


* drop Image also

In [23]:
acquiring=acquiring.drop('Image', axis=1)

In [14]:
ValidateLinksDF(acquisitions)

Acquisition Profile
Column "Acquisition Profile" had no valid links , or is using captcha.
Try it yourself:
http://www.crunchbase.com/acquisition/5dc676a13d41c2ee87169ce59476ec2d

News Link


* acquisitions profile is also a crunchbase link

In [15]:
ValidateLinksDF(founders)

CrunchBase Profile
Image
