# Aeronautics & Astronautics Abstracts

Can a machine distinguish between propulsion and thermophysics abstracts? <br> 
https://www.kaggle.com/sylar68/aeronautics-astronautics-journal-abstracts

### Content
The Aeronautics & Astronautics Abstracts dataset includes titles and abstracts of about 493 papers published by AIAA either in the journal of propulsion and power (JPP), or in the journal of thermophysics and heat transfers (JTHT) which were manually retrieved from https://arc.aiaa.org. The task is to build a classifier that is able to distinguish between abstracts and/or titles from each specific technical domain. The challenge lies in that both domains (propulsion, heat transfers) contains vocabulary that overlaps such as (combustion, exchange, thermal, fluid, etc…) which makes it harder to distinguish which journal it comes from.



### Import the required libraries

In [1]:
import os
import numpy as np
import pandas as pd
from pandas import DataFrame

# a library for reading data and formatting information from Excel files
import xlrd

# libraries for text processing
import nltk
nltk.download('punkt')

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

[nltk_data] Downloading package punkt to /Users/elena/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Read Data from excel

We downloaded the aiaa_dataset.xls dataset and stored it in the INPUT_DATA_DIR as defined below.

In [42]:
# define some information about where to get our data
INPUT_DATA_DIR = os.environ.get('INPUT_DATA_DIR', '/workspace/data/')

In [45]:
%%time
raw_df = pd.read_excel(os.path.join(INPUT_DATA_DIR, 'aero-astro-abstracts/aiaa_dataset.xls')) 
raw_df.head()

CPU times: user 27.1 ms, sys: 3.85 ms, total: 30.9 ms
Wall time: 41.3 ms


Unnamed: 0,title,abstract,journal,volume
0,Timescale-Based Frozen Nonadiabatic Flamelet C...,The present research work introduces a novel c...,JPP,37.4
1,Development and Testing of Liquid Simulants,A group of liquid simulants was developed in o...,JPP,37.4
2,Conjugate Analysis of Silica-Phenolic Charring...,Because of its excellent insulation capability...,JPP,37.4
3,Theoretical Analysis of Performance Parameters...,Conventional expressions and definitions descr...,JPP,37.4
4,Hurst Exponents for Intra- and Intercycle Ther...,"The detrended fluctuation analysis, a techniqu...",JPP,37.4


### Explore dataset properties

In [4]:
raw_df.shape

(493, 4)

In [5]:
# check out the columns with nulls
raw_df.isnull().any()

title       False
abstract    False
journal     False
volume      False
dtype: bool

In [54]:
# check out the duplicates
raw_df[raw_df.duplicated().values == True]
raw_df.drop_duplicates(inplace=True, ignore_index=False)
raw_df.shape

(492, 4)

In [7]:
# check unique values in columns
print(raw_df.nunique())
    
a = raw_df['volume'].unique()
a.sort()
print(f'\nunique values in the volume column: {a}')

b = raw_df['journal'].unique()
print(f'\nunique values in the journal column: {b}')

title       474
abstract    474
journal       2
volume       25
dtype: int64

unique values in the volume column: [33.1 33.2 33.3 33.4 34.1 34.2 34.3 34.4 34.5 35.1 35.2 35.3 35.4 35.5
 35.6 36.1 36.2 36.3 36.4 36.5 36.6 37.1 37.2 37.3 37.4]

unique values in the journal column: ['JPP' 'JTHT']


### Word embeding

In [2]:
#df = pd.read_csv('glove.840B.300d.txt', sep=" ", quoting=3, header=None, index_col=0)
#glove_model = {key: val.values for key, val in df.T.items()}

# Retrieve vocabulary from text data
We will explore vocabulary of text from each type of journals

### Retrieve combined text as well as text per each journal

In [31]:
text_data = [r['title'] + ' ' + r['abstract'] for i, r in raw_df.iterrows()]

text_jpp = [
    r['title'] + ' ' + r['abstract'] for i, r in raw_df[raw_df['journal'] == 'JPP'].iterrows()
]
text_jtht = [
    r['title'] + ' ' + r['abstract'] for i, r in raw_df[raw_df['journal'] == 'JTHT'].iterrows()
]

### Get vocalulary using Natural Language Toolkit¶

In [36]:
tokens_all = sorted(set(nltk.word_tokenize(' '.join(text_data).lower())))

tokens_jpp = sorted(set(nltk.word_tokenize(' '.join(text_jpp).lower())))
tokens_jtht = sorted(set(nltk.word_tokenize(' '.join(text_jtht).lower())))

print(len(tokens_all))
print(len(tokens_jpp))
print(len(tokens_jtht))


8283
5564
5317


In [56]:
tokens_common = [x for x in tokens_jpp if x in tokens_jtht]

tokens_jpp_unique = [x for x in tokens_jpp if x not in tokens_common]
tokens_jtht_unique = [x for x in tokens_jtht if x not in tokens_common]

print(f'Number of common tokens: {len(tokens_common)}')
print(f'Number of JPP specific tokens: {len(tokens_jpp_unique)}')
print(f'Number of JTHT specific tokens:  {len(tokens_jtht_unique)}')

Number of common tokens: 2598
Number of JPP specific tokens: 2966
Number of JTHT specific tokens:  2719


### Common vocabulary

In [39]:
print(tokens_common)

['%', '(', ')', '*', ',', '.', '0', '0.04', '0.1', '0.2', '0.22', '0.25', '0.3', '0.4', '0.5', '0.6', '0.7', '0.75', '0.8', '0.8.', '1', '1.0', '1.3', '1.4', '1.5', '1.6', '1.8', '10', '10,000', '100', '1000.', '11', '12', '120', '13', '15', '150', '160', '180', '19', '2', '2-d', '2.1', '2.2', '2.5', '20', '200', '2000', '21', '24', '25', '250', '26', '27', '3', '3-d', '3.1', '3.5', '30', '300', '3000', '33', '35', '36', '3d', '4', '4.', '40', '43', '450', '5', '50', '500', '6', '60', '600', '7', '70', '72', '75', '8', '80', '800', '87.5', '9', '90', '900', ':', ';', '<', '>', '[', ']', 'a', 'ab', 'ability', 'ablated', 'ablating', 'ablation', 'ablative', 'ablator', 'able', 'about', 'above', 'absolute', 'absorbed', 'absorbing', 'accelerated', 'acceleration', 'access', 'accompanied', 'accomplished', 'according', 'account', 'accounts', 'accretion', 'accumulation', 'accuracy', 'accurate', 'accurately', 'achieve', 'achieved', 'achieves', 'acid', 'acquired', 'across', 'act', 'activated', 'ac

### JPP vocabulary

In [40]:
print(tokens_jpp_unique)



### JTHT vocabulary

In [41]:
print(tokens_jtht_unique)

["''", '-diameter', '/water', '0.0001', '0.0001–0.01', '0.0005', '0.00056', '0.0008575', '0.001', '0.001.', '0.00109', '0.00572', '0.006', '0.008', '0.01', '0.015', '0.01°c', '0.01–1', '0.01≤𝑅𝑖≤10', '0.020', '0.023°c/w', '0.02°c', '0.041.', '0.0416', '0.05', '0.0625', '0.08', '0.081', '0.083', '0.106', '0.12', '0.15', '0.166', '0.1–0.6', '0.1≤𝐴≤0.6', '0.1≤𝐴𝑅≤0.4', '0.1≤𝑅𝑐≤0.4', '0.208kg/s', '0.228', '0.25≤𝐿𝑇≤1', '0.31', '0.33', '0.38', '0.385', '0.3–0.7m/s', '0.4–1', '0.54', '0.6–1.4', '0.6–1.4m/s', '0.71.', '0.8877', '0.89', '0.917.', '0deg≤𝛾≤90deg', '0°c', '0–0.12', '0–20', '0–90', '0≤𝐻𝑎≤45', '0≤𝜀≤1', '0≤𝜙≤0.05', '1,225,616', '1.', '1.03±0.26', '1.1–5.9', '1.2', '1.23', '1.25', '1.5.', '1.64.', '1.79', '1.93', '1.95', '10-state', '10.1', '10.3', '10.4', '10.50', '10.6', '1000', '100mw⋅m−2', '100°c', '1010', '103', '103–106', '103≤𝑅𝑎≤106', '104', '104≤𝑅𝑎≤106', '105pa', '105–106', '106', '107', '10km/s', '10°', '10°c', '10°≤𝛾≤350°', '10–15and25nm', '10−10', '10−104', '10−3', '10−3≤𝐷𝑎≤1