# Import

In [1]:
import pandas as pd
import numpy as np
from math import radians, cos, sin, asin, sqrt
pd.options.display.max_columns=999

# Read In Data

In [2]:
# root
folder = './2019/04/14/'

## Aircraft
Has the information about each tracked aircraft. Key metrics can be used:

- **Id**: Aircraft ID
- **AircraftModelId**: Model ID to **join `model` data**
- **Blocked**: Whether the aircraft is blocked by the FAA

In [3]:
aircraft = pd.read_csv(folder + "aircraft_20190414030057.csv")
aircraft = aircraft[["Id", "AircraftModelId", "Blocked"]]

## Model
Has the model information for each aircraft. Key metrics can be used:

(`Crew` is not used here since it has value as large as 48,200, which is not a reasonable number for crew members of a flight)

- **Id**: Can be used to join on **AircraftModelId** in `aircraft` table
- **NormalCruiseSpeed**: Normal cruising speed for the aircraft model
- **NormalRange**: Normal range for the aircraft model
- **NormalPassengers**: Normal passengers for the aircraft model

In [4]:
model = pd.read_csv(folder + "model_20190414030057.csv")
model = model[["Id", "NormalCruiseSpeed", "NormalRange", "NormalPassengers"]]

## Ownership
Has the information about the ownership for each tracked aircraft. Key metrics can be used:

- **AircraftId**: Can be used to join on **Id** in `Aircraft` table
- **CompanyId**: Can be used to join on **Id** in `Company` table
- **OwnershipPercentage**: Ownership percentage held by the company

In [5]:
ownership = pd.read_csv(folder + "ownership_20190414030057.csv")
ownership = ownership[["AircraftId", "CompanyId", "OwnershipPercentage"]]

## Company

Has the information about tracked companies. Key metrics can be used:
- **Id**: Can be used to join on **CompanyId** in `Ownership` table
- **Symbol**: Ticker if the company is public traded, otherwise is empty
- **Industry or Sector**: Indicator of the company's industry or sector
- **Latitude & Longitude**: Geo-coordinate
- **Figi**: For mapping purpose

In [6]:
company = pd.read_csv(folder + "company_20190414030057.csv")
company = company[["Id", "Exchange", "Symbol", "Industry", "Sector", "Latitude", "Longitude", "Figi"]]

## Relationship
Has the information about the relationship between two companies. Key metrics can be used:

- **CompanyId**: Can be used to join on **Id** in `Company` table
- **RelatedCompanyId**: Can be used to join on **Id** in `Company` table
- **Type**: Type of relationship
- **StartDate**: Beginning of the date range when the relationship was active
- **EndDate**: End of the date range when the relationship was active

In [7]:
relationship = pd.read_csv(folder + "relationships_20190414030057.csv")
relationship = relationship[["CompanyId", "RelatedCompanyId", "Type", "StartDate", "EndDate"]]

## Airport
Has the information about the airports where aircrafts take of or land. Key metrics can be used:

- **Icao**: The ICAO of the airport
- **Classification**: filter out “closed” and use numerical value to show the data instead of categorical
- **Latitude & Longitude**: Can be used to locate the airport and companies nearby
- **Country**: Can be used to indicate overseas or not

In [8]:
airport = pd.read_csv(folder + "airport_20190414030057.csv")
airport = airport[["Icao", "Classification", "Latitude", "Longitude", "Country"]]

## Flight
Has the information about flight hisotry on tracked aircrafts. Key metrics can be used:

- **AircraftId**: Can be used to join on **Id** in `Aircraft` table
- **DepartureTime & ArrivalTime**: Departure and arrival time
- **DepartureIcao & ArrivalIcao**: Can be used to join on **ICao** in `airport` table
- **StayDurationSeconds**: Length of the stay expressed in seconds (Crucial metrics for staying time. Based on the report from JetTrack, a short duration time (one or two hour) indicates a group was dropped off and picked up later. Therefore, the length of the duration could indicate more information than just a time period)

In [9]:
flight = pd.read_csv(folder + "flight_20190414030057.csv", parse_dates=["DepartureTime", "ArrivalTime"])
flight = flight[["AircraftId", "DepartureTime", "DepartureIcao", "ArrivalTime", "ArrivalIcao", "StayDurationSeconds"]].sort_values("DepartureTime").reset_index(drop=True)

## Transactions
Has the information about M&A transactions in different status/stages. Key metrics can be used:
- **Id**: Can be used in filtering
- **CompanyId1**: Can be used to join on **Id** in `Company` table
- **CompanyId2**: Can be used to join on **Id** in `Company` table
- **Status**: Include null, Rumor, Pending, Cancelled, and Complete
- **RumorDate**: The date when the transaction is in Rumor stage
- **CancelDate**: The date when the transaction is cancelled
- **ClosingDate**: The date when the transaction is closed
- **AnnouncementDate**: The date when the transaction is in Announcement stage
- **TargetedClosingDate**: The date when the transaction is expected to be closed

In [10]:
transaction = pd.read_csv(folder + "transactions_20190414030057.csv", parse_dates=["RumorDate", "CancelDate", "ClosingDate", "AnnouncementDate", "TargetedClosingDate"])

# Data Transformation

## Features:

In [11]:
# 1. Aircraft: Change blocked values to 0 and 1
aircraft.rename(columns={"Id": "AircraftId"}, inplace=True)
aircraft["Blocked"] = aircraft.apply(lambda x: 1 if x["Blocked"] == True else (0 if x["Blocked"] == False else np.nan), axis=1)

In [12]:
# 2. Merge Aircraft with Model -> air_mod
model.rename(columns={"Id": "AircraftModelId"}, inplace=True)
air_mod = aircraft.merge(model, how="left", on=["AircraftModelId"])

In [13]:
# 3. Merge air_mod with Ownership -> air_mod_own
air_mod_own = air_mod.merge(ownership, how="left", on=["AircraftId"])

In [14]:
# 4. Clean the companies
# - Rename Id to CompanyId for future join
company.rename(columns={"Id": "CompanyId"}, inplace=True)
# - Drop companies without Latitude nor Longitude
company = company[~((company["Latitude"].isnull()) | (company["Longitude"].isnull()))]
# - Drop companies without Industry nor Sector
company = company[~((company["Sector"].isnull()) | (company["Industry"].isnull()))]
# - Drop companies without Symbol since it's impossible to map
company = company[~company["Symbol"].isnull()]
# - Create Ticker = Exchange : Symbol to make each Ticker is unique
company["Ticker"] = company["Exchange"] + ":" + company["Symbol"]
# - Reorder the dataframe
company = company[["CompanyId", "Ticker", "Sector", "Latitude", "Longitude", "Figi"]]

In [15]:
# 5. Merge air_mod_own with company -> air_mod_own_comp
air_mod_own_comp = air_mod_own.merge(company, how="left", on="CompanyId")

In [16]:
# 6. Map company to Relationship
comp_rel = relationship.merge(company, how="left", on="CompanyId").merge(company, how="left", left_on="RelatedCompanyId", right_on="CompanyId", suffixes=["_Source", "_Target"])

In [17]:
# 7. Map Airport to Flight -> airport_flight
airport_flight = flight.merge(airport, how="left", left_on="DepartureIcao", right_on="Icao").merge(airport, how="left", left_on="ArrivalIcao", right_on="Icao", suffixes=["_Departure", "_Arrival"])
del airport_flight["DepartureIcao"]
del airport_flight["ArrivalIcao"]
airport_flight["SameCountry"] = airport_flight.apply(lambda x: 1 if x["Country_Departure"] == x["Country_Arrival"] else 0, axis=1)
del airport_flight["Country_Departure"]
del airport_flight["Country_Arrival"]

In [18]:
# 8. Use Latitude + Longitude + defined radius algo to find target companies within the distance range from the arrival airport
#    then expand the flight records (if 5 companies are within the range, then that one flight records will be expanded into 5) -> new_flights
RADIUS = 200
COMP_DICT = company[["Ticker", "Longitude", "Latitude"]].to_dict("records")
def comp_detct(lon_a, lat_a, lon_c, lat_c, radius):
    """
    Calculate whether a company is whithin the range of an airport
    for a given radius (in miles)
    """
    lon_a, lat_a, lon_c, lat_c = map(np.radians, [lon_a, lat_a, lon_c, lat_c])

    # haversine formula 
    dlon = lon_c - lon_a 
    dlat = lat_c - lat_a 
    a = np.sin(dlat/2.0)**2 + np.cos(lat_a) * np.cos(lat_c) * np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 3959 # Radius of earth in miles
    d = c * r
    
    if d <= radius: 
        return True
    else:
        return False
    
def master_detct(lon_a, lat_a):
    comp_list = []
    for value in COMP_DICT:
        lon_c = value["Longitude"]
        lat_c = value["Latitude"]
        ticker = value["Ticker"]
        if comp_detct(lon_a, lat_a, lon_c, lat_c, RADIUS):
            comp_list.append(ticker)            
    return comp_list

### Concerns Raised
- The above algorithm will go through each row in the `exp_airport_flight`
- For each row, it will scan through all companies in the `COMP_DICT` to find companies within defined range (100 miles) from the arrival airport
- For each row, it will take **232ms**. 
- To scan all rows in `exp_airport_flight`, it will take 1,879,522 * 232 / 1000 / 60 / 60 / 24 = **5 days**

### Solutions
1. Shrink the flight data based on certain range of depature date.
    - Not applicable since length of the data where `DepartureTime >= '2019-01-01'` is 112,976, which will take 7 hours to finish
2. **Shrink the flight data based on certain tickers.**
    - Based on the transaction files, the firm has most transactions among the whole period are:
        - Id=160 (11 transactions)
        - Id=53151 (9 transactions)
        - Id=133 (9 transactions)
        - Id=4589 (8 transactions)
    - Total flight records of above companies are 6,212

In [19]:
# 8. Merge exp_airport_flight with air_mod_own_comp for departure companies (and keep above companies only)
airport_flight_dep_comp = airport_flight.merge(air_mod_own_comp, how="left", on="AircraftId")
airport_flight_dep_comp = airport_flight_dep_comp[airport_flight_dep_comp["CompanyId"].isin([160, 53151, 133, 4589])]
cols = ["Ticker", "Figi", "Sector", "DepartureTime", "Classification_Departure"] + \
       [c for c in airport_flight_dep_comp.columns if "Normal" in c] + ["OwnershipPercentage", "Blocked"] + \
       ["ArrivalTime", "Latitude_Arrival", "Longitude_Arrival", "Classification_Arrival"] + \
       ["SameCountry", "StayDurationSeconds"]
exp_airport_flight_dep_comp = airport_flight_dep_comp[cols]

In [20]:
%%time
# 9. Use Latitude + Longitude + defined radius algo to find target companies within the distance range from the arrival airport
# will roughly take 20 mins
exp_airport_flight_dep_comp["PotentialCompanies"] = exp_airport_flight_dep_comp.apply(lambda x: master_detct(x["Longitude_Arrival"], x["Latitude_Arrival"]), axis=1)

CPU times: user 21min 10s, sys: 2.34 s, total: 21min 12s
Wall time: 21min 14s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [21]:
# Further filtering: over 800,000 negative StayDurationSeconds
exp_airport_flight_dep_comp = exp_airport_flight_dep_comp[exp_airport_flight_dep_comp["StayDurationSeconds"] > 0].reset_index(drop=True)

# Unnest PotentialCompanies for each flight record
unnest = pd.DataFrame({"PotentialCompanies":np.concatenate(exp_airport_flight_dep_comp.PotentialCompanies.values)},index=exp_airport_flight_dep_comp.index.repeat(exp_airport_flight_dep_comp.PotentialCompanies.str.len()))
new_flights = unnest.join(exp_airport_flight_dep_comp.drop("PotentialCompanies",1), how="left")
new_flights = new_flights[list(new_flights.columns)[1:] + ["PotentialCompanies"]]
new_flights.rename(columns={"PotentialCompanies": "TargetCompany"}, inplace=True)

In [22]:
# 9. Merge new_flights with company for arrival company
final_new_flights = new_flights.merge(company, how="left", left_on="TargetCompany", right_on="Ticker", suffixes=["_Source", "_Target"])
final_new_flights.drop(["TargetCompany", "CompanyId", "Latitude", "Longitude"], axis=1, inplace=True)

In [23]:
# 10. Map `Relationship` to `new_flights` -> `final_features`
cols = ["Ticker_Source", "Ticker_Target", "Type", "StartDate", "EndDate"]
rel = comp_rel[cols]
final_features = final_new_flights.merge(rel, how="left", on=["Ticker_Source", "Ticker_Target"])

### Notes
- The `fianl_features` has 6,846,694 records 
- Only 10,422 of them have relationship
- The rate of that is very small, so make the `final_features = final_new_flights` at the step 9

In [24]:
final_features = final_new_flights.copy()

# Further cleaning
# Remove latitude and longitude of the arrival airports
final_features.drop(["Latitude_Arrival", "Longitude_Arrival"], axis=1, inplace=True)
# Remove the records where Ticker_Source == Ticker_Target
final_features = final_features.query("Ticker_Source != Ticker_Target")
# Remove the records where Ticker_Source is null
final_features = final_features[~final_features["Ticker_Source"].isnull()]
# Remove the records where Ticker_Target is null
final_features = final_features[~final_features["Ticker_Target"].isnull()].reset_index(drop=True)

## Targets:

In [25]:
# 1. Remove TargetedClosingDate since it's not meaningful in terms of defining Status
# 2. Remove CancelDate since there are no values
updated_transaction = transaction.drop(["TargetedClosingDate", "CancelDate"], axis=1)

# 3. Remove all records where Status == Cancelled since there are no CancelDate for all Cancelled records
updated_transaction = updated_transaction[updated_transaction["Status"] != "Cancelled"]

# 4. Remove all records where Status == null
updated_transaction = updated_transaction[~updated_transaction["Status"].isnull()]

# 5. Keep the records where Status == Rumor.
"""
- Reason: Might be some insights even thought that is only a rumor
- Notes: 
    - Rumor has the same `RumorDate` and `AnnoucementDate`
    - Didn't observe the `Rumor Cancelled` status, so will assume `Rumor` will last forever
- Remove `Id = 6316`
"""
updated_transaction = updated_transaction[updated_transaction["Id"] != 6316]

# 6. Filter the data down to CompanyId1 in (160, 53151, 133, 4589)
updated_transaction = updated_transaction.query("CompanyId1 in [160, 53151, 133, 4589]")

# 7. Check if there are duplicated records of the combination of (CompanyId1, CompanyId2)
len(updated_transaction.groupby(["CompanyId1", "CompanyId2"]).size().reset_index().rename(columns={0: "count"}).query("count != 1")) == 0

# 8. Merge transaction data with company info
tran_comp = updated_transaction.merge(company[["CompanyId", "Ticker"]], how="left", left_on="CompanyId1", right_on="CompanyId").merge(company[["CompanyId", "Ticker"]], how="left", left_on="CompanyId2", right_on="CompanyId", suffixes=["_Source", "_Target"])
tran_comp = tran_comp[["Ticker_Source", "Ticker_Target", "Status", "RumorDate", "AnnouncementDate", "ClosingDate"]]

# 9. Filter out non-public companies
tran_comp = tran_comp[~tran_comp["Ticker_Source"].isnull()]
tran_comp = tran_comp[~tran_comp["Ticker_Target"].isnull()]

# 10. Remove records where Status == "Complete" but ClosingDate is null
tran_comp["Remove"] = tran_comp.apply(lambda x: 1 if (x["Status"] == "Complete" and str(x["ClosingDate"]) == 'NaT') else 0, axis=1)
tran_comp = tran_comp[tran_comp["Remove"] == 0].reset_index(drop=True)
del tran_comp["Remove"]

# 11. Merge feature with targets
final_df = final_features.merge(tran_comp, how="left")

In [26]:
def status_match(x):
    """
    Find the right status based on the arrival time
    """
    if str(x["Status"]) == "nan":
        return "Nothing"
    elif str(x["Status"]) == "Rumor":
        if x["ArrivalTime"] < x["RumorDate"]:
            return "Nothing"
        else:
            return "Rumor"
    elif str(x["Status"]) == "Pending":
        if str(x["RumorDate"]) != 'NaT' and x["ArrivalTime"] < x["RumorDate"]:
            return "Nothing"
        elif str(x["RumorDate"]) != 'NaT' and x["ArrivalTime"] >= x["RumorDate"] and x["ArrivalTime"] < x["AnnouncementDate"]:
            return "Rumor"
        elif x["ArrivalTime"] >= x["AnnouncementDate"]:
            return "Pending"
    elif str(x["Status"]) == "Complete":
        if str(x["RumorDate"]) != 'NaT' and x["ArrivalTime"] < x["RumorDate"]:
            return "Nothing"
        elif str(x["RumorDate"]) != 'NaT' and x["ArrivalTime"] >= x["RumorDate"] and x["ArrivalTime"] < x["AnnouncementDate"]:
            return "Rumor"
        elif x["ArrivalTime"] >= x["AnnouncementDate"] and x["ArrivalTime"] < x["ClosingDate"]:
            return "Pending"
        elif x["ArrivalTime"] >= x["ClosingDate"]:
            return "Complete"

In [27]:
%%time
# 12. define the new status based on the time
final_df["NewStatus"] = final_df.apply(lambda x: status_match(x), axis=1)
final_df.drop(["Status", "RumorDate", "AnnouncementDate", "ClosingDate"], axis=1, inplace=True)
final_df = final_df.sort_values(["DepartureTime", "Ticker_Source"]).reset_index(drop=True)

CPU times: user 1min 41s, sys: 5.09 s, total: 1min 47s
Wall time: 1min 47s


In [28]:
final_df.head()

Unnamed: 0,Ticker_Source,Figi_Source,Sector_Source,DepartureTime,Classification_Departure,NormalCruiseSpeed,NormalRange,NormalPassengers,OwnershipPercentage,Blocked,ArrivalTime,Classification_Arrival,SameCountry,StayDurationSeconds,Ticker_Target,Sector_Target,Figi_Target,NewStatus
0,NYS:BSX,BBG000C0LY07,Health Technology,2007-01-01 13:47:00,large_airport,488.0,5940.0,13.0,100.0,1.0,2007-01-01 14:15:24,small_airport,1,104676.0,NYS:MD,Health Services,BBG000H8LJM4,Nothing
1,NYS:BSX,BBG000C0LY07,Health Technology,2007-01-01 13:47:00,large_airport,488.0,5940.0,13.0,100.0,1.0,2007-01-01 14:15:24,small_airport,1,104676.0,NYS:NEE,Utilities,BBG000BJSF01,Nothing
2,NYS:BSX,BBG000C0LY07,Health Technology,2007-01-01 13:47:00,large_airport,488.0,5940.0,13.0,100.0,1.0,2007-01-01 14:15:24,small_airport,1,104676.0,NYS:CCL,Consumer Services,BBG000BF6PR2,Nothing
3,NYS:BSX,BBG000C0LY07,Health Technology,2007-01-01 13:47:00,large_airport,488.0,5940.0,13.0,100.0,1.0,2007-01-01 14:15:24,small_airport,1,104676.0,NAS:CTXS,Technology Services,BBG000FQ9611,Nothing
4,NYS:BSX,BBG000C0LY07,Health Technology,2007-01-01 13:47:00,large_airport,488.0,5940.0,13.0,100.0,1.0,2007-01-01 14:15:24,small_airport,1,104676.0,NYS:JBL,Electronic Technology,BBG000BJNJ80,Nothing


## Write to CSV

In [None]:
final_df.to_csv("jettrack.csv", index=False)

# Remained Concerns & Thoughts

1. The `final_df` has 6,839,339 records after using `radius` = 100 miles. Amongst them, only 186 records have NewStatus (target) as non "Nothing".
2. We can combine the flight data with our merge/acquisition signal data, where we could first spot the merge/acquisition signal and then look at the associated flight data in a relative time range.
3. If we are using longitude and latitude to detect the potential associated companies within a certain radius of the arrival airport, we can use machine learning to detect the optimal radius.
4. Regarding setting radius, we have the following concerns. Consider the two scenarios:
    - When a large-size company, like Amazon, flies to a large city, like New York, it is extremely hard to know what Amazon is up to in New York, given the fact that NYC is a big city and there are way too many possibilities that it becomes unreasonable to predict the company’s traveling purposes. In addition, there are a lot of companies’ headquarters locating in big city, like New York. Therefore, for a flight that Amazon takes to NYC, we can get over thousands of companies based on our current model.
    - On the other hand, it is unusual for a large-size company, like Amazon, to flies to a relatively smaller city, like Austin. In such case, it is easier and more reasonable for us to predict what Amazon is up to in Austin. Given the fact that there aren’t too many companies setting their headquarter in Austin, we can get a shorter list of targeting companies based on our current model, and such flight information may be more valuable than the ones about Amazon flying to NYC.
5. After discussion with Joschi, a reasonable radius could be 4-hr driving distance from the arrival airport, which is 80 miles/hour * 4 hrs = 320 miles. However, 320 miles is pretty large, 200 miles is more reasonable as the radius.