## East Airline Analysis

## # 1 Introduction 
### 1.1 What is data
A data is a compilation of plain facts that have been gathered within a defined context. Statisticians would refer to it as a set of observations contained in variables or columns of varying/unique characteristics. Data can also be referred to as a piece of information after it has been summarised, and when subjected to analysis, data can be referred to as evidence of a hypothesis. Before data can become information and ultimately evidence, it must go through a process. Data analysis is the process in which raw data is ordered and organized, to be used in methods that help to explain the past and predict the future (Hector, 2013). 

While data analytics as a science does not have a predefined methodology as each dataset and context is different, generally, the process of turning raw data into actionable insight can be expected to include early data analysis (EDA), descriptive analysis, statistical analysis, diagnostic analysis, predictive analysis and prescriptive analysis. With all this in mind, regardless of the domain, there is always an ultimate goal that most analytical project seek to achieve from a dataset, which is to derive actionable insight and communicate those insights to help decision makers such as stakeholders make informed decision about the context in which the data had been gathered. Most of the time, this decision makers are not data experts. While as data professionals, we should be very informed and interested in the subtle details of analytical methodology, there is a clear gap between the information we seek to communicate and the ability of those that need this information to understand them, most of the time due to lack technical background to understand these methodologies.  This is where Data visualisation come into play. It is logically safe to assume that putting trends and insight in pictorial form will be more informative than telling numbers or texts. At the end of the day, a good data analytics implementation should provide a clear picture of where an organisation for example is, where they have been and predict where they are heading to (UCDPA,2024).

### 1.2 Machine learning 

Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on building computer systems that learn from data (Linda, 2021). Machine learning algorithms are trained to find relationships and patterns in data (Linda, 2021). Though different ML algorithms work in different ways, their ultimate goal is to make predictions about the future of a particular data context by learning the pattern in the present available data. The quality and reliability of the predictions made is dependent on the quality of the data used to train the ML algorithms. 

### Branches of Machine learning 

1. Supervised machine learning
2. Unsupervised machine learning 

### About data
The file EastWestAirlines contains information on passengers who belong to an airline’s frequent flier program.
For each passenger the data include information on their mileage history and on different ways they accrued or
spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics
for the purpose of targeting different segments for different types of mileage offers.

### About Data (Seeds data) (ref: UCI repo)
Soybean cultivation is one of the most important because it is used in several segments of the food industry. The evaluation of soybean cultivars subject to different planting and harvesting characteristics is an ongoing field of research. We present a dataset obtained from forty soybean cultivars planted in subsequent seasons. The experiment used randomized blocks, arranged in a split-plot scheme, with four replications. The following variables were collected: plant height, insertion of the first pod, number of stems, number of legumes per plant, number of grains per pod, a thousand seed weight, and grain yield, resulting in 320 data samples. The dataset presented can be used by researchers from different fields of activity.
### Task
Use appropriate cluter to identify clusters in the soybean cultivars harvested in subsequent seasons.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings('ignore')

## Another data

In [8]:
df_1 = pd.read_csv('data.csv')

In [9]:
df_1.head()

Unnamed: 0,Season,Cultivar,Repetition,PH,IFP,NLP,NGP,NGL,NS,MHG,GY
0,1,NEO 760 CE,1,58.8,15.2,98.2,177.8,1.81,5.2,152.2,3232.82
1,1,NEO 760 CE,2,58.6,13.4,102.0,195.0,1.85,7.2,141.69,3517.36
2,1,NEO 760 CE,3,63.4,17.2,100.4,203.0,2.02,6.8,148.81,3391.46
3,1,NEO 760 CE,4,60.27,15.27,100.2,191.93,1.89,6.4,148.5,3312.58
4,1,MANU IPRO,1,81.2,18.0,98.8,173.0,1.75,7.4,145.59,3230.99


## Early data analysis

In [10]:
df_1.isnull().sum()

Season        0
Cultivar      0
Repetition    0
PH            0
IFP           0
NLP           0
NGP           0
NGL           0
NS            0
MHG           0
GY            0
dtype: int64

In [11]:
df_1.describe()

Unnamed: 0,Season,Repetition,PH,IFP,NLP,NGP,NGL,NS,MHG,GY
count,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0,320.0
mean,1.5,2.5,68.386781,15.465,59.088313,135.085844,2.290844,4.071656,168.322313,3418.553794
std,0.500783,1.119785,8.958194,3.0243,20.068187,60.494529,0.840116,1.474531,19.625566,503.003602
min,1.0,1.0,47.6,7.2,20.2,47.8,0.94,0.4,127.06,1538.23
25%,1.0,1.75,62.95,13.6,44.35,95.0525,2.0,3.0,153.845,3126.611552
50%,1.5,2.5,67.2,15.6,54.5,123.0,2.28,3.8,166.15,3397.276724
75%,2.0,3.25,74.3475,17.33,71.22,161.35,2.48,5.0,183.1825,3708.262931
max,2.0,4.0,94.8,26.4,123.0,683.4,14.86,9.0,216.0,4930.0
