# Status Mission Prediction

## 1. Problem Statement
 This project aim to understand how the Status Mission is affected by other variables such as Company, Country, Ownership and many other possible variables to be discover. 

## 2. Data Collection
* Data Source: https://www.kaggle.com/datasets/davidroberts13/one-small-step-for-data
* Data Shape: 4324 rows X 15 columns

## 2.1 Import Data and Packages

### Import Pands, Numpy, Matplolib

In [69]:
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import set_config
from datetime import datetime

### Import CSV Data as Panda DataFrame

In [2]:
df = pd.read_csv('global_space_launches.csv')

#### Display Data

In [3]:
df

Unnamed: 0,Company Name,Location,Detail,Status Rocket,Rocket,Status Mission,Country of Launch,Companys Country of Origin,Private or State Run,DateTime,Year,Month,Day,Date,Time
0,SpaceX,"LC-39A, Kennedy Space Center, Florida, USA",Falcon 9 Block 5 | Starlink V1 L9 & BlackSky,StatusActive,50.0,Success,USA,USA,P,2020-08-07 05:12:00+00:00,2020,8,7,07/08/2020,05:12
1,CASIC,"Site 9401 (SLS-2), Jiuquan Satellite Launch Ce...",Long March 2D | Gaofen-9 04 & Q-SAT,StatusActive,29.75,Success,China,China,S,2020-08-06 04:01:00+00:00,2020,8,6,06/08/2020,04:01
2,SpaceX,"Pad A, Boca Chica, Texas, USA",Starship Prototype | 150 Meter Hop,StatusActive,,Success,USA,USA,P,2020-08-04 23:57:00+00:00,2020,8,4,04/08/2020,23:57
3,Roscosmos,"Site 200/39, Baikonur Cosmodrome, Kazakhstan",Proton-M/Briz-M | Ekspress-80 & Ekspress-103,StatusActive,65.0,Success,Kazakhstan,Russia,S,2020-07-30 21:25:00+00:00,2020,7,30,30/07/2020,21:25
4,ULA,"SLC-41, Cape Canaveral AFS, Florida, USA",Atlas V 541 | Perseverance,StatusActive,145.0,Success,USA,USA,P,2020-07-30 11:50:00+00:00,2020,7,30,30/07/2020,11:50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4319,US Navy,"LC-18A, Cape Canaveral AFS, Florida, USA",Vanguard | Vanguard TV3BU,StatusRetired,,Failure,USA,USA,S,1958-02-05 07:33:00+00:00,1958,2,5,05/02/1958,07:33
4320,AMBA,"LC-26A, Cape Canaveral AFS, Florida, USA",Juno I | Explorer 1,StatusRetired,,Success,USA,USA,S,1958-02-01 03:48:00+00:00,1958,2,1,01/02/1958,03:48
4321,US Navy,"LC-18A, Cape Canaveral AFS, Florida, USA",Vanguard | Vanguard TV3,StatusRetired,,Failure,USA,USA,S,1957-12-06 16:44:00+00:00,1957,12,6,06/12/1957,16:44
4322,RVSN USSR,"Site 1/5, Baikonur Cosmodrome, Kazakhstan",Sputnik 8K71PS | Sputnik-2,StatusRetired,,Success,Kazakhstan,Russia,S,1957-11-03 02:30:00+00:00,1957,11,3,03/11/1957,02:30


## 3. Data Checks to Perform
* Columns Cleanliness & Readability
* Check Missing Values
* Check Duplicates
* Check Data Type
* Check Unique Values
* Check Statistics
* Check Categories

## 3.1 Columns Cleanliness & Readability

In [4]:
# Renaming long names
rename_dict = {'Company Name': 'Company',
               ' Rocket': 'Rocket',
               'Country of Launch': 'Launch Country', 
               'Companys Country of Origin': 'Company Origin', 
               'Private or State Run': 'Ownership'}
df.rename(columns=rename_dict, inplace=True)

# Drop duplicate column
df.drop(columns=['DateTime', 'Date'], inplace=True)

## 3.2 Check Missing Values

In [5]:
df.isna().sum()

Company              0
Location             0
Detail               0
Status Rocket        0
Rocket            3360
Status Mission       0
Launch Country       0
Company Origin       0
Ownership            0
Year                 0
Month                0
Day                  0
Time                 0
dtype: int64

There are missing values for Rocket column

### Handling Missing Values

In [6]:
# Adding NA column {isna:1}
df['Rocket_isna'] = np.where(df['Rocket'].isna(), 1, 0)

# Convert everything to strings and delete commas
df['Rocket'] = df['Rocket'].astype(str).str.replace(',', '').astype(float)

## 3.3 Check Duplicates

In [7]:
df.duplicated().sum()

1

In [8]:
# Drop dupplicates
df = df.drop_duplicates()
df.duplicated().sum()

0

## 3.4 Check Data Type

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4323 entries, 0 to 4323
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Company         4323 non-null   object 
 1   Location        4323 non-null   object 
 2   Detail          4323 non-null   object 
 3   Status Rocket   4323 non-null   object 
 4   Rocket          963 non-null    float64
 5   Status Mission  4323 non-null   object 
 6   Launch Country  4323 non-null   object 
 7   Company Origin  4323 non-null   object 
 8   Ownership       4323 non-null   object 
 9   Year            4323 non-null   int64  
 10  Month           4323 non-null   int64  
 11  Day             4323 non-null   int64  
 12  Time            4323 non-null   object 
 13  Rocket_isna     4323 non-null   int32  
dtypes: float64(1), int32(1), int64(3), object(9)
memory usage: 489.7+ KB


## 3.5 Check Unique Values

In [10]:
# Count the number of unique values in each column 
df.nunique()

Company             55
Location           137
Detail            4278
Status Rocket        2
Rocket              56
Status Mission       4
Launch Country      16
Company Origin      17
Ownership            2
Year                64
Month               12
Day                 31
Time              1273
Rocket_isna          2
dtype: int64

## 3.6 Check Statistics

In [11]:
df.describe(include='all')

Unnamed: 0,Company,Location,Detail,Status Rocket,Rocket,Status Mission,Launch Country,Company Origin,Ownership,Year,Month,Day,Time,Rocket_isna
count,4323,4323,4323,4323,963.0,4323,4323,4323,4323,4323.0,4323.0,4323.0,4323,4323.0
unique,55,137,4278,2,,4,16,17,2,,,,1273,
top,RVSN USSR,"Site 31/6, Baikonur Cosmodrome, Kazakhstan",Cosmos-3MRB (65MRB) | BOR-5 Shuttle,StatusRetired,,Success,Russia,Russia,S,,,,00:00,
freq,1777,235,6,3534,,3878,1398,2064,2930,,,,135,
mean,,,,,153.921007,,,,,1987.381911,6.753181,16.441591,,0.777238
std,,,,,288.572876,,,,,18.071932,3.416812,8.635934,,0.416148
min,,,,,5.3,,,,,1957.0,1.0,1.0,,0.0
25%,,,,,40.0,,,,,1972.0,4.0,9.0,,1.0
50%,,,,,62.0,,,,,1984.0,7.0,17.0,,1.0
75%,,,,,164.0,,,,,2002.0,10.0,24.0,,1.0


## 3.7 Check Categories

In [12]:
# Define numerical & categorical columns
initial_numerical_features = [feature for feature in df.columns if df[feature].dtype != 'object']
initial_categorical_features = [feature for feature in df.columns if df[feature].dtype == 'object']

print('There are possible {} numerical features: {}\n'.format(len(initial_numerical_features), initial_numerical_features))
print('There are possible {} categorical features: {}'.format(len(initial_categorical_features), initial_categorical_features))

There are possible 5 numerical features: ['Rocket', 'Year', 'Month', 'Day', 'Rocket_isna']

There are possible 9 categorical features: ['Company', 'Location', 'Detail', 'Status Rocket', 'Status Mission', 'Launch Country', 'Company Origin', 'Ownership', 'Time']


## 4. Handling Various Features

### Splitting Multi-Element Categorical Feature: Location

In [13]:
# Calculate frequency of each category
location_freq = (df['Location'].value_counts())

# Create a dict. mapping each location name and its frequencies
location_freq_dict = location_freq.to_dict()

location_freq_dict

{'Site 31/6, Baikonur Cosmodrome, Kazakhstan': 235,
 'Site 132/1, Plesetsk Cosmodrome, Russia': 216,
 'Site 43/4, Plesetsk Cosmodrome, Russia': 202,
 'Site 41/1, Plesetsk Cosmodrome, Russia': 198,
 'Site 1/5, Baikonur Cosmodrome, Kazakhstan': 193,
 'Site 132/2, Plesetsk Cosmodrome, Russia': 174,
 'Site 133/3, Plesetsk Cosmodrome, Russia': 158,
 'Site 43/3, Plesetsk Cosmodrome, Russia': 138,
 'LC-39A, Kennedy Space Center, Florida, USA': 120,
 'ELA-2, Guiana Space Centre, French Guiana, France': 118,
 'SLC-40, Cape Canaveral AFS, Florida, USA': 111,
 'ELA-3, Guiana Space Centre, French Guiana, France': 109,
 'SLC-41, Cape Canaveral AFS, Florida, USA': 97,
 'SLC-4W, Vandenberg AFB, California, USA': 93,
 'SLC-4E, Vandenberg AFB, California, USA': 83,
 'SLC-17A, Cape Canaveral AFS, Florida, USA': 80,
 'SLC-36B, Cape Canaveral AFS, Florida, USA': 75,
 'LA-Y1, Tanegashima Space Center, Japan': 73,
 'SLC-36A, Cape Canaveral AFS, Florida, USA': 70,
 'Site 90/20, Baikonur Cosmodrome, Kazakhsta

There may be value in extracting the information from this column into additional features.
It looks like the information is broken down into the follow format:

U.S has 4 commas separator highlighting - Launch Pad, Facility, State, Country
Other coutries has 3 commas separator - Launch Pad, Facility, Country

Launch Pad & Facility may be of use because we already have another categorical feature detailing the Launch Counntry.

In [14]:
# Split location information into separate columns 
split_locations = df['Location'].str.split(', ', n=4, expand=True)

# Extract 'Launch Pad' and 'Facility' components
df.loc[:, 'Launch Pad'] = split_locations[0]
df.loc[:, 'Facility'] = split_locations[1]

# Drop 'Location'
df.drop(columns=['Location'], inplace=True)

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, 'Launch Pad'] = split_locations[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, 'Facility'] = split_locations[1]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['Location'], inplace=True)


Unnamed: 0,Company,Detail,Status Rocket,Rocket,Status Mission,Launch Country,Company Origin,Ownership,Year,Month,Day,Time,Rocket_isna,Launch Pad,Facility
0,SpaceX,Falcon 9 Block 5 | Starlink V1 L9 & BlackSky,StatusActive,50.00,Success,USA,USA,P,2020,8,7,05:12,0,LC-39A,Kennedy Space Center
1,CASIC,Long March 2D | Gaofen-9 04 & Q-SAT,StatusActive,29.75,Success,China,China,S,2020,8,6,04:01,0,Site 9401 (SLS-2),Jiuquan Satellite Launch Center
2,SpaceX,Starship Prototype | 150 Meter Hop,StatusActive,,Success,USA,USA,P,2020,8,4,23:57,1,Pad A,Boca Chica
3,Roscosmos,Proton-M/Briz-M | Ekspress-80 & Ekspress-103,StatusActive,65.00,Success,Kazakhstan,Russia,S,2020,7,30,21:25,0,Site 200/39,Baikonur Cosmodrome
4,ULA,Atlas V 541 | Perseverance,StatusActive,145.00,Success,USA,USA,P,2020,7,30,11:50,0,SLC-41,Cape Canaveral AFS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4319,US Navy,Vanguard | Vanguard TV3BU,StatusRetired,,Failure,USA,USA,S,1958,2,5,07:33,1,LC-18A,Cape Canaveral AFS
4320,AMBA,Juno I | Explorer 1,StatusRetired,,Success,USA,USA,S,1958,2,1,03:48,1,LC-26A,Cape Canaveral AFS
4321,US Navy,Vanguard | Vanguard TV3,StatusRetired,,Failure,USA,USA,S,1957,12,6,16:44,1,LC-18A,Cape Canaveral AFS
4322,RVSN USSR,Sputnik 8K71PS | Sputnik-2,StatusRetired,,Success,Kazakhstan,Russia,S,1957,11,3,02:30,1,Site 1/5,Baikonur Cosmodrome


### Splitting Multi-Element Categorical Feature: Detail

In [15]:
df['Detail'].head(20)

0          Falcon 9 Block 5 | Starlink V1 L9 & BlackSky
1                   Long March 2D | Gaofen-9 04 & Q-SAT
2                    Starship Prototype | 150 Meter Hop
3          Proton-M/Briz-M | Ekspress-80 & Ekspress-103
4                            Atlas V 541 | Perseverance
5     Long March 4B | Ziyuan-3 03, Apocalypse-10 & N...
6                           Soyuz 2.1a | Progress MS-15
7                              Long March 5 | Tianwen-1
8                          Falcon 9 Block 5 | ANASIS-II
9                         H-IIA 202 | Hope Mars Mission
10                               Minotaur IV | NROL-129
11           Kuaizhou 11 | Jilin-1 02E, CentiSpace-1 S2
12                          Long March 3B/E | Apstar-6D
13                                   Shavit-2 | Ofek-16
14                          Long March 2D | Shiyan-6 02
15          Electron/Curie | Pics Or It Didn??¦t Happen
16                 Long March 4B | Gaofen Duomo & BY-02
17                      Falcon 9 Block 5 | GPS I

Detail column has high number of unique value so there is no point to pursue frequency mapping.
From observation, it looks like the information is broken down as follow:

Space Vehicle | Mission

There is more interest to capture the information for Space Vehicle because missions will be unique and different each launch. 

In [16]:
# Split Detail information into separate columns 
split_details = df['Detail'].str.split(' | ', n=2, expand=True)

# Extract 'Space Vehicle' components
df.loc[:, 'Space Vehicle'] = split_details[0]

# Drop 'Detail'
df.drop(columns=['Detail'], inplace=True)

df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, 'Space Vehicle'] = split_details[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['Detail'], inplace=True)


Unnamed: 0,Company,Status Rocket,Rocket,Status Mission,Launch Country,Company Origin,Ownership,Year,Month,Day,Time,Rocket_isna,Launch Pad,Facility,Space Vehicle
0,SpaceX,StatusActive,50.00,Success,USA,USA,P,2020,8,7,05:12,0,LC-39A,Kennedy Space Center,Falcon
1,CASIC,StatusActive,29.75,Success,China,China,S,2020,8,6,04:01,0,Site 9401 (SLS-2),Jiuquan Satellite Launch Center,Long
2,SpaceX,StatusActive,,Success,USA,USA,P,2020,8,4,23:57,1,Pad A,Boca Chica,Starship
3,Roscosmos,StatusActive,65.00,Success,Kazakhstan,Russia,S,2020,7,30,21:25,0,Site 200/39,Baikonur Cosmodrome,Proton-M/Briz-M
4,ULA,StatusActive,145.00,Success,USA,USA,P,2020,7,30,11:50,0,SLC-41,Cape Canaveral AFS,Atlas
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4319,US Navy,StatusRetired,,Failure,USA,USA,S,1958,2,5,07:33,1,LC-18A,Cape Canaveral AFS,Vanguard
4320,AMBA,StatusRetired,,Success,USA,USA,S,1958,2,1,03:48,1,LC-26A,Cape Canaveral AFS,Juno
4321,US Navy,StatusRetired,,Failure,USA,USA,S,1957,12,6,16:44,1,LC-18A,Cape Canaveral AFS,Vanguard
4322,RVSN USSR,StatusRetired,,Success,Kazakhstan,Russia,S,1957,11,3,02:30,1,Site 1/5,Baikonur Cosmodrome,Sputnik


In [17]:
df.describe(include='all')

Unnamed: 0,Company,Status Rocket,Rocket,Status Mission,Launch Country,Company Origin,Ownership,Year,Month,Day,Time,Rocket_isna,Launch Pad,Facility,Space Vehicle
count,4323,4323,963.0,4323,4323,4323,4323,4323.0,4323.0,4323.0,4323,4323.0,4323,4323,4323
unique,55,2,,4,16,17,2,,,,1273,,130,44,128
top,RVSN USSR,StatusRetired,,Success,Russia,Russia,S,,,,00:00,,Site 31/6,Plesetsk Cosmodrome,Cosmos-3M
freq,1777,3534,,3878,1398,2064,2930,,,,135,,235,1263,446
mean,,,153.921007,,,,,1987.381911,6.753181,16.441591,,0.777238,,,
std,,,288.572876,,,,,18.071932,3.416812,8.635934,,0.416148,,,
min,,,5.3,,,,,1957.0,1.0,1.0,,0.0,,,
25%,,,40.0,,,,,1972.0,4.0,9.0,,1.0,,,
50%,,,62.0,,,,,1984.0,7.0,17.0,,1.0,,,
75%,,,164.0,,,,,2002.0,10.0,24.0,,1.0,,,


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4323 entries, 0 to 4323
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Company         4323 non-null   object 
 1   Status Rocket   4323 non-null   object 
 2   Rocket          963 non-null    float64
 3   Status Mission  4323 non-null   object 
 4   Launch Country  4323 non-null   object 
 5   Company Origin  4323 non-null   object 
 6   Ownership       4323 non-null   object 
 7   Year            4323 non-null   int64  
 8   Month           4323 non-null   int64  
 9   Day             4323 non-null   int64  
 10  Time            4323 non-null   object 
 11  Rocket_isna     4323 non-null   int32  
 12  Launch Pad      4323 non-null   object 
 13  Facility        4323 non-null   object 
 14  Space Vehicle   4323 non-null   object 
dtypes: float64(1), int32(1), int64(3), object(10)
memory usage: 523.5+ KB


### Handling High Cardinality Categorical Features (4): 
### Company, Launch Pad, Facility, Space Vehicle 

In [19]:
high_cardinality_features = ['Company', 'Launch Pad', 'Facility', 'Space Vehicle']

# Step 1: Store the frequency dictionary in the high_freq_dict dictionary
high_freq_dict = {}
for feature in high_cardinality_features:
    high_freq_dict[feature] = df[feature].value_counts().to_dict()

for feature, high_freq in high_freq_dict.items():
    print(f"Frequency dictionary for {feature}:")
    print(high_freq, '\n')
    
# Step 2: Define function to use in FunctionTransformer
#         map features to frequencies
def map_features_to_freq(i, freq_dict):
    return np.vectorize(lambda x: freq_dict.get(x, 0))(i)

# Step 3: Creating pipeline
high_cardinality_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                            ('frequency_mapper', FunctionTransformer(map_features_to_freq, kw_args={'freq_dict': high_freq_dict}))])

Frequency dictionary for Company:
{'RVSN USSR': 1777, 'Arianespace': 279, 'CASIC': 255, 'General Dynamics': 251, 'NASA': 203, 'VKS RF': 201, 'US Air Force': 161, 'ULA': 140, 'Boeing': 136, 'Martin Marietta': 114, 'SpaceX': 100, 'MHI': 84, 'Northrop': 83, 'Lockheed': 79, 'ISRO': 76, 'Roscosmos': 55, 'ILS': 46, 'Sea Launch': 36, 'ISAS': 30, 'Kosmotras': 22, 'US Navy': 17, 'Rocket Lab': 13, 'Eurockot': 13, 'ESA': 13, 'ISA': 13, 'Blue Origin': 12, 'IAI': 11, 'ExPace': 10, 'ASI': 9, 'CNES': 8, 'AMBA': 8, 'MITT': 7, 'Land Launch': 7, 'JAXA': 7, 'UT': 5, 'KCST': 5, 'Exos': 4, 'CECLES': 4, "Arme de l'Air": 4, 'SRC': 3, 'AEB': 3, 'KARI': 3, 'RAE': 2, 'OKB-586': 2, 'Yuzhmash': 2, 'Landspace': 1, 'Starsem': 1, 'Douglas': 1, 'EER': 1, 'Virgin Orbit': 1, 'IRGC': 1, 'i-Space': 1, 'OneSpace': 1, 'Sandia': 1, 'Khrunichev': 1} 

Frequency dictionary for Launch Pad:
{'Site 31/6': 235, 'Site 132/1': 216, 'Site 43/4': 202, 'Site 41/1': 198, 'Site 1/5': 193, 'Site 132/2': 174, 'Site 133/3': 158, 'Site 43/3

In [20]:
# EXAMPLE COMPANY)
example_company = df['Company']

result_company = map_features_to_freq(example_company, high_freq_dict['Company'])
print("\nResult for Company:")
print(result_company)
print(type(result_company))


Result for Company:
[ 100  255  100 ...   17 1777 1777]
<class 'numpy.ndarray'>


### Handling Low Cardinality Categorical Features (2): 
### Launch Country, Company Origin 

In [21]:
low_cardinality_features = ['Launch Country','Company Origin']

# Step 1: Store the top 10 most frequent unique count 
top10_low_freq_dict = {}
for feature in low_cardinality_features:
    top10_low_freq_dict[feature] = df[feature].value_counts().head(10).to_dict()
    
for feature, low_freq in top10_low_freq_dict.items():
    print(f"Frequency dictionary for {feature}:")
    print(low_freq, '\n')

# Step 2: Define function to use in FunctionTransformer
#         to select top 10 most frequent unique count
def select_top10_freq(i, top10_dict):
    return np.vectorize(lambda x: top10_dict.get(x, 0))(i)

# Step 3: Creating pipeline
low_cardinality_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                                           ('top10_selector', FunctionTransformer(select_top10_freq, kw_args={'top10_dict': top10_low_freq_dict})),
                                           ('onehot', OneHotEncoder(drop='first', handle_unknown='error'))])


Frequency dictionary for Launch Country:
{'Russia': 1398, 'USA': 1351, 'Kazakhstan': 701, 'France': 303, 'China': 268, 'Japan': 126, 'India': 76, 'Sea Launch': 36, 'Iran': 14, 'New Zealand': 13} 

Frequency dictionary for Company Origin:
{'Russia': 2064, 'USA': 1374, 'Multi': 339, 'China': 268, 'Japan': 126, 'India': 76, 'Isreal': 24, 'Germany': 13, 'Italy': 9, 'France': 8} 



In [22]:
# EXAMPLE LAUNCH COUNTRY) 

example_launch_country = df['Launch Country']

result_launch_country = select_top10_freq(example_launch_country, top10_low_freq_dict['Launch Country'])
print("Result for Launch Country:")
print(result_launch_country)
print(type(result_launch_country))

Result for Launch Country:
[1351  268 1351 ... 1351  701  701]
<class 'numpy.ndarray'>


### Handling Categorical Time Feature (1): 
### Time 

### Handling Binary Feature (2): 
### Rocket Status, Ownership 

In [25]:
binary_features = ['Status Rocket', 'Ownership', 'Rocket_isna']

# Step 1: Store the binary dictionary 
binary_dict = {}
for feature in binary_features:
    if feature == 'Status Rocket':
        binary_dict[feature] = {'StatusActive': 1, 'StatusRetired': 0}
    elif feature == 'Ownership':
        binary_dict[feature] = {'S': 1, 'P': 0}
    elif feature == 'Rocket_isna':
        binary_dict[feature] = {1: 1, 0: 0}


# Step 2: Define function to use in FunctionTransformer
#         map features to binary
def map_binary(i, bin_dict):
    return np.vectorize(lambda x: bin_dict.get(x, 0))(i)

# Step 3: Creating pipeline
binary_cardinality_pipeline = Pipeline(steps=[('binary_encode', FunctionTransformer(map_binary, kw_args={'bin_dict': binary_dict}))])

In [26]:
# EXAMPLE STATUS ROCKET) 
example_status_rocket = df['Status Rocket']

result_status_rocket = map_binary(example_status_rocket, binary_dict['Status Rocket'])
print("Result for Status Rocket:")
print(result_status_rocket)
print(type(result_status_rocket))

Result for Status Rocket:
[1 1 1 ... 0 0 0]
<class 'numpy.ndarray'>


In [27]:
categorical_features = low_cardinality_features + high_cardinality_features + binary_features
categorical_features

['Launch Country',
 'Company Origin',
 'Company',
 'Launch Pad',
 'Facility',
 'Space Vehicle',
 'Status Rocket',
 'Ownership',
 'Rocket_isna']

### Handling Categorical Target Feature (1): 
### Status Mission 

In [28]:
df.loc[:, 'Status Mission'] = (df['Status Mission'] == 'Success')

### Handling Numerical Features (4):
### Rocket, Year, Month, Day

In [29]:
numerical_features =  ['Rocket', 'Year', 'Month', 'Day']

imp_strat = 'constant'
numerical_pipeline =  Pipeline([('imputer', SimpleImputer(strategy=imp_strat, fill_value=0)),  
                                ('scaler', StandardScaler())])

In [30]:
# EXAMPLE ROCKET) 

example_rocket = df['Rocket']
print(type(example_rocket[1]))

<class 'numpy.float64'>


## 5. Preprocessor

In [31]:
# W/O Time
# ('time', time_cardinality_pipeline, ['Time'])

categorical_preprocessor = ColumnTransformer(
    transformers=[('high_cardinality', high_cardinality_pipeline, high_cardinality_features),
                  ('low_cardinality', low_cardinality_pipeline, low_cardinality_features),
                  ('binary_cardinality', binary_cardinality_pipeline, binary_features)])

numerical_preprocessor = ColumnTransformer(
    transformers=[('numerical_scaling', numerical_pipeline, numerical_features)])

preprocessor = ColumnTransformer(
    transformers=[('numerical', numerical_preprocessor, numerical_features),
                  ('categorical', categorical_preprocessor, categorical_features)])

In [32]:
X = df[categorical_features+numerical_features]
y = df['Status Mission'].astype(int)

In [33]:
# Splitting the dataset into Test|Train : 30|70 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)

In [34]:
final_pipe = make_pipeline(preprocessor, LogisticRegression())
set_config(display='diagram')
display(final_pipe)

In [35]:
# Fit and transform the training data using the final pipeline
X_train_processed = final_pipe.named_steps['columntransformer'].fit_transform(X_train, y_train)

In [36]:
# Transform the test data using the preprocessor
X_test_processed = final_pipe.named_steps['columntransformer'].transform(X_test)

## 6. Model

In [37]:
models = {'Logistic Regression': LogisticRegression()}

for name, model in models.items():
    model.fit(X_train_processed, y_train)
    y_train_prediction = model.predict(X_train_processed)
    y_test_prediction = model.predict(X_test_processed)
    
    train_accuracy = model.score(X_train_processed, y_train)
    test_accuracy = model.score(X_test_processed, y_test)
    
    print(f'{name} Trained Accuracy: {train_accuracy*100: .2f}%')
    print(f'{name} Tested Accuracy: {test_accuracy*100: .2f}%')
    
    error_train = y_train_prediction - y_train
    error_test = y_test_prediction - y_test
    
    # Calculate the number of occurrences of 0 error
    zero_error_train_count = (error_train == 0).sum()
    zero_error_test_count = (error_test == 0).sum()

    zero_error_train_percent = (zero_error_train_count/len(error_train))*100
    zero_error_test_percent = (zero_error_test_count/len(error_test))*100
    print('\nManual no score method')
    print(f'{name} Trained Score: {zero_error_train_percent: .2f}%')
    print(f'{name} Tested Score: {zero_error_test_percent: .2f}%')
    

    con_matrix = confusion_matrix(y_test, y_test_prediction)
    true_negative = con_matrix[0,0]
    false_positive = con_matrix[0,1]
    false_negative = con_matrix[1,0]
    true_positive = con_matrix[1,1]
    print('')
    print('True Negative:', true_negative)
    print('False Positive:', false_positive)
    print('False Negative:', false_negative)
    print('True Positive:', true_positive)
    
    
    report = classification_report(y_test, y_test_prediction)
    print('')
    print(report)
    # Split the report into lines and extract the line containing metrics for the desired class
    lines = report.split('\n')
    target_class_line = lines[3]  # Assuming the desired class is at index 3

    # Split the line by whitespace and extract precision, recall, and F1-score
    precision, recall, f1_score, support = target_class_line.split()[1:]

    print("Precision:", precision)
    print("Recall:", recall)
    print("F1-score:", f1_score)

Logistic Regression Trained Accuracy:  89.76%
Logistic Regression Tested Accuracy:  89.59%

Manual no score method
Logistic Regression Trained Score:  89.76%
Logistic Regression Tested Score:  89.59%

True Negative: 0
False Positive: 135
False Negative: 0
True Positive: 1162

              precision    recall  f1-score   support

           0       0.00      0.00      0.00       135
           1       0.90      1.00      0.95      1162

    accuracy                           0.90      1297
   macro avg       0.45      0.50      0.47      1297
weighted avg       0.80      0.90      0.85      1297

Precision: 0.90
Recall: 1.00
F1-score: 0.95


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [47]:
X_train.shape

(3026, 13)

In [46]:
X_train_processed.shape

(3026, 11)

In [40]:
df_train_processed = pd.DataFrame(X_train_processed, columns=numerical_features + categorical_features)
df_train_processed

ValueError: Shape of passed values is (3026, 11), indices imply (3026, 13)

In [49]:
model = LogisticRegression()

In [90]:
X=df[['Year']]
type(X)

pandas.core.frame.DataFrame

In [91]:
y=df['Status Mission'].astype(int)
type(y)

pandas.core.series.Series

In [55]:
model.fit(X, y)

In [56]:
model.score(X, y)

0.8970622253065001

In [66]:
y_pred = model.predict(X)
y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [None]:
y

In [61]:
error = y_pred - y
error

0       0
1       0
2       0
3       0
4       0
       ..
4319    1
4320    0
4321    1
4322    0
4323    0
Name: Status Mission, Length: 4323, dtype: int32

In [70]:
confusion_matrix(y_pred, y)

array([[   0,    0],
       [ 445, 3878]], dtype=int64)

In [74]:
sum(error==-1)/len(error)

0.0

In [76]:
sum(y_pred == y)/len(error)

0.8970622253065001

In [77]:
y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [78]:
sum(y_pred)

4323

In [87]:
df[~df['Status Mission'].astype(bool)].sample(100)

Unnamed: 0,Company,Status Rocket,Rocket,Status Mission,Launch Country,Company Origin,Ownership,Year,Month,Day,Time,Rocket_isna,Launch Pad,Facility,Space Vehicle
629,SpaceX,StatusRetired,59.5,False,USA,USA,P,2012,10,8,00:35,0,SLC-40,Cape Canaveral AFS,Falcon
91,Exos,StatusActive,,False,USA,USA,P,2019,10,26,17:40,1,Vertical Launch Area,Spaceport America,SARGE
4259,US Air Force,StatusRetired,,False,USA,USA,S,1960,6,29,22:00,1,SLC-1W (75-3-4),Vandenberg AFB,Thor-DM18
3499,RVSN USSR,StatusRetired,,False,Russia,Russia,S,1969,12,27,14:20,1,Site 132/1,Plesetsk Cosmodrome,Cosmos-3M
3671,RVSN USSR,StatusRetired,,False,Kazakhstan,Russia,S,1968,4,24,16:00,1,Site 90/19,Baikonur Cosmodrome,Tsyklon-2A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1145,ISRO,StatusRetired,47.0,False,India,India,S,2001,4,18,10:13,0,First Launch Pad,Satish Dhawan Space Centre,GSLV
3152,RVSN USSR,StatusRetired,,False,Kazakhstan,Russia,S,1973,4,25,09:10,1,Site 90/19,Baikonur Cosmodrome,Tsyklon-2
1680,General Dynamics,StatusRetired,,False,USA,USA,P,1992,8,22,22:40,1,SLC-36B,Cape Canaveral AFS,Atlas
4102,General Dynamics,StatusRetired,,False,USA,USA,P,1962,12,17,20:36,1,SLC-3E,Vandenberg AFB,Atlas-LV3


In [88]:
df[df['Status Mission'].astype(bool)].sample(100)

Unnamed: 0,Company,Status Rocket,Rocket,Status Mission,Launch Country,Company Origin,Ownership,Year,Month,Day,Time,Rocket_isna,Launch Pad,Facility,Space Vehicle
3619,RVSN USSR,StatusRetired,,True,Kazakhstan,Russia,S,1968,11,1,00:27,1,Site 90/20,Baikonur Cosmodrome,Tsyklon-2A
2653,RVSN USSR,StatusRetired,,True,Russia,Russia,S,1977,10,21,10:05,1,Site 132/1,Plesetsk Cosmodrome,Cosmos-3M
1495,VKS RF,StatusRetired,,True,Russia,Russia,S,1995,8,9,01:21,1,Site 43/3,Plesetsk Cosmodrome,Molniya-M
2028,RVSN USSR,StatusRetired,,True,Russia,Russia,S,1986,12,10,07:30,1,Site 32/2,Plesetsk Cosmodrome,Tsyklon-3
2451,ISRO,StatusRetired,,True,India,India,S,1980,7,18,02:33,1,SLV LP,Satish Dhawan Space Centre,SLV-3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3852,General Dynamics,StatusRetired,,True,USA,USA,P,1966,8,16,18:30,1,SLC-4E,Vandenberg AFB,Atlas-SLV3
2871,RVSN USSR,StatusRetired,,True,Russia,Russia,S,1975,12,3,10:00,1,Site 43/3,Plesetsk Cosmodrome,Voskhod
3527,RVSN USSR,StatusRetired,,True,Russia,Russia,S,1969,9,24,12:15,1,Site 41/1,Plesetsk Cosmodrome,Voskhod
707,Northrop,StatusActive,46.0,True,USA,USA,P,2010,11,20,01:25,0,LP-1,Pacific Spaceport Complex,Minotaur
