## East Airline Analysis

## # 1 Introduction 
### 1.1 What is data
A data is a compilation of plain facts that have been gathered within a defined context. Statisticians would refer to it as a set of observations contained in variables or columns of varying/unique characteristics. Data can also be referred to as a piece of information after it has been summarised, and when subjected to analysis, data can be referred to as evidence of a hypothesis. Before data can become information and ultimately evidence, it must go through a process. Data analysis is the process in which raw data is ordered and organized, to be used in methods that help to explain the past and predict the future (Hector, 2013). 

While data analytics as a science does not have a predefined methodology as each dataset and context is different, generally, the process of turning raw data into actionable insight can be expected to include early data analysis (EDA), descriptive analysis, statistical analysis, diagnostic analysis, predictive analysis and prescriptive analysis. With all this in mind, regardless of the domain, there is always an ultimate goal that most analytical project seek to achieve from a dataset, which is to derive actionable insight and communicate those insights to help decision makers such as stakeholders make informed decision about the context in which the data had been gathered. Most of the time, this decision makers are not data experts. While as data professionals, we should be very informed and interested in the subtle details of analytical methodology, there is a clear gap between the information we seek to communicate and the ability of those that need this information to understand them, most of the time due to lack technical background to understand these methodologies.  This is where Data visualisation come into play. It is logically safe to assume that putting trends and insight in pictorial form will be more informative than telling numbers or texts. At the end of the day, a good data analytics implementation should provide a clear picture of where an organisation for example is, where they have been and predict where they are heading to (UCDPA,2024).

### 1.2 Machine learning 

Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on building computer systems that learn from data (Linda, 2021). Machine learning algorithms are trained to find relationships and patterns in data (Linda, 2021). Though different ML algorithms work in different ways, their ultimate goal is to make predictions about the future of a particular data context by learning the pattern in the present available data. The quality and reliability of the predictions made is dependent on the quality of the data used to train the ML algorithms. 

### Branches of Machine learning 

1. Supervised machine learning
2. Unsupervised machine learning 

### About data
The file EastWestAirlines contains information on passengers who belong to an airline’s frequent flier program.
For each passenger the data include information on their mileage history and on different ways they accrued or
spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics
for the purpose of targeting different segments for different types of mileage offers.

### About Data (Seeds data)


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
df= pd.read_csv('codon_usage.csv')

In [3]:
df.head()

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,vrl,0,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.0005,0.00351,0.01203,...,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.0005,0.0
1,vrl,0,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,...,0.00136,0.01696,0.03596,0.01221,0.04545,0.0156,0.0441,0.00271,0.00068,0.0
2,vrl,0,100755,4862,Sweet potato leaf curl virus,0.01974,0.0218,0.01357,0.01543,0.00782,...,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.0,0.00144
3,vrl,0,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,...,0.00366,0.0141,0.01671,0.0376,0.01932,0.03029,0.03446,0.00261,0.00157,0.0
4,vrl,0,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.0138,...,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.0,0.00044,0.00131


In [4]:
df.shape

(13028, 69)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13028 entries, 0 to 13027
Data columns (total 69 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Kingdom      13028 non-null  object 
 1   DNAtype      13028 non-null  int64  
 2   SpeciesID    13028 non-null  int64  
 3   Ncodons      13028 non-null  int64  
 4   SpeciesName  13028 non-null  object 
 5   UUU          13028 non-null  object 
 6   UUC          13028 non-null  object 
 7   UUA          13028 non-null  float64
 8   UUG          13028 non-null  float64
 9   CUU          13028 non-null  float64
 10  CUC          13028 non-null  float64
 11  CUA          13028 non-null  float64
 12  CUG          13028 non-null  float64
 13  AUU          13028 non-null  float64
 14  AUC          13028 non-null  float64
 15  AUA          13028 non-null  float64
 16  AUG          13028 non-null  float64
 17  GUU          13028 non-null  float64
 18  GUC          13028 non-null  float64
 19  GUA 

In [6]:
df.isnull()

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13023,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13024,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13025,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
13026,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [7]:
df.describe()

Unnamed: 0,DNAtype,SpeciesID,Ncodons,UUA,UUG,CUU,CUC,CUA,CUG,AUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
count,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,...,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0,13028.0
mean,0.367209,130451.105926,79605.76,0.020637,0.014104,0.01782,0.018288,0.019044,0.01845,0.028352,...,0.005454,0.009929,0.006422,0.024178,0.021164,0.02829,0.021683,0.001645,0.000592,0.006178
std,0.688726,124787.086107,719701.0,0.020709,0.00928,0.010586,0.014572,0.02425,0.016578,0.017507,...,0.006605,0.008574,0.006387,0.013828,0.013041,0.014342,0.015018,0.001834,0.000907,0.010344
min,0.0,7.0,1000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,28850.75,1602.0,0.00561,0.007108,0.01089,0.00783,0.005307,0.00718,0.01636,...,0.00122,0.00169,0.00117,0.01238,0.01186,0.01736,0.00971,0.00056,0.0,0.00041
50%,0.0,81971.5,2927.5,0.01526,0.01336,0.01613,0.01456,0.009685,0.0128,0.025475,...,0.00353,0.00927,0.004545,0.02542,0.01907,0.026085,0.02054,0.00138,0.00042,0.00113
75%,1.0,222891.25,9120.0,0.029485,0.01981,0.02273,0.025112,0.017245,0.024315,0.038113,...,0.00715,0.015922,0.01025,0.03419,0.02769,0.0368,0.031122,0.00237,0.00083,0.00289
max,12.0,465364.0,40662580.0,0.15133,0.10119,0.08978,0.10035,0.16392,0.10737,0.15406,...,0.05554,0.09883,0.05843,0.18566,0.11384,0.14489,0.15855,0.0452,0.02561,0.1067


## Another data

In [8]:
df_1 = pd.read_csv('data.csv')

In [9]:
df_1.head()

Unnamed: 0,Season,Cultivar,Repetition,PH,IFP,NLP,NGP,NGL,NS,MHG,GY
0,1,NEO 760 CE,1,58.8,15.2,98.2,177.8,1.81,5.2,152.2,3232.82
1,1,NEO 760 CE,2,58.6,13.4,102.0,195.0,1.85,7.2,141.69,3517.36
2,1,NEO 760 CE,3,63.4,17.2,100.4,203.0,2.02,6.8,148.81,3391.46
3,1,NEO 760 CE,4,60.27,15.27,100.2,191.93,1.89,6.4,148.5,3312.58
4,1,MANU IPRO,1,81.2,18.0,98.8,173.0,1.75,7.4,145.59,3230.99


In [10]:
df_1.isnull().sum()

Season        0
Cultivar      0
Repetition    0
PH            0
IFP           0
NLP           0
NGP           0
NGL           0
NS            0
MHG           0
GY            0
dtype: int64

In [11]:
df_1.describe()

Unnamed: 0,Season,Repetition,PH,IFP,NLP,NGP,NGL,NS,MHG,GY
count,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0
mean,1.5,2.5,68.386781,15.465,59.088313,135.085844,2.290844,4.071656,168.322313,3418.553794
std,0.500783,1.119785,8.958194,3.0243,20.068187,60.494529,0.840116,1.474531,19.625566,503.003602
min,1.0,1.0,47.6,7.2,20.2,47.8,0.94,0.4,127.06,1538.23
25%,1.0,1.75,62.95,13.6,44.35,95.0525,2.0,3.0,153.845,3126.611552
50%,1.5,2.5,67.2,15.6,54.5,123.0,2.28,3.8,166.15,3397.276724
75%,2.0,3.25,74.3475,17.33,71.22,161.35,2.48,5.0,183.1825,3708.262931
max,2.0,4.0,94.8,26.4,123.0,683.4,14.86,9.0,216.0,4930.0
