# Delete Later

## Assessment Task

Students are advised to review and adhere to the submission requirements documented after the
assessment task.

Scenario:

“Today, big data is ubiquitous, machine learning applications are thriving, artificial intelligence appears in
everyday conversations, and the internet of things is present even in household appliances. Businesses and
organizations are increasingly managed through cloud computing and high performance computing is
progressively accessible as a service…More effective operations, reduced uncertainties, and real time
decision-support could revolutionize agriculture to a great extent . Food could be produced more efficiently,
of higher nutritional quality, in more stable supplies, with less environmental damage, and likely with
additional economic, social, and ecological benefits.”(Sjoukje A. Osinga, Dilli Paudel, Spiros A. Mouzakitis,
Ioannis N. Athanasiadis (2022))

You have been tasked with analysing Ireland's Agricultural data and comparing the Irish Agri sector with
other countries worldwide. This analysis should also include forecasting, sentiment analysis and evidence
based recommendations for the sector as well as a complete rationale of the entire process used to
discover your findings. Your Research could include export, import, trade imbalance, arable production,
animal stock, medicinal input, organic, gm products etc. (or any other relevant topic EXCEPT Climate
change) with Ireland as your base line.

Note:
- While topical, Agricultural impact on Climate Change SHOULD NOT be chosen as an area of research for this assessment.
- Members of the European Union implement the Common Agricultural Policy and this should be researched as it has a significant statistical impact.
- The United Kingdom is NOT part of the European Union

You must source appropriate data sets from any available repository to inform your research (all datasets MUST be referenced and the relevant licence/permissions detailed).

### Criteria of Analysis

Discuss the choice of project management framework you have deemed suitable for this project.
It is expected that you use some type of version control software eg: GitHub, Gitlab, BitBucket etc with regular commits of code and report versions. Please include the address of your version control repository in your report.

### Programming for DA Tasks [0-100]
- The project must be explored programmatically, this means that you must implement suitable Python tools (code and/or libraries) to complete the analysis required. All of this is to be implemented in a Jupyter Notebook.[0-50]
- The project documentation must include sound justifications and explanation of your code choices. (code quality standards should also be applied) [0-50]

Total Mark = 50+50=100:(100\*0.5=50%)

### Statistics for Data Analytics Tasks
- Use descriptive statistics and appropriate visualisations in order to summarise the dataset(s) used, and to help justify the chosen models. [0-20]
- Analyse the variables in your dataset(s) and use appropriate inferential statistics to gain insights on possible population values (e.g., if you were working with international commerce, you could find a confidence interval for the population proportion of yearly dairy exports out of all agricultural exports). [0-20]
- Undertake research to find similarities between some country(s) against Ireland, and apply parametric and non-parametric inferential statistical techniques to compare them (e.g., t-test, analysis of variance, Wilcoxon test, chi-squared test, among others). You must justify your choices and verify the applicability of the tests. Hypotheses and conclusions must be clearly stated. You are expected to use at least 5 different inferential statistics tests. [0-40]
- Use the outcome of your analysis to deepen your research. Indicate the challenges you faced in the process. [0-20]

Note: All your calculations and reasoning behind your models must be documented in the report and/or the appendix.
Total Mark = 20+20+40+20=100:(100\*0.5=50%)

### Machine Learning Tasks
Use of multiple models (at least two) to compare and contrast results and insights gained.
- Describe the rationale and justification for the choice of machine learning models for the above-mentioned scenario. Machine Learning models can be used for Prediction, Classification, Clustering, sentiment analysis, recommendation systems and Time series analysis. You should plan on trying multiple approaches (at least two) with proper selection of hyperparameters using GridSearchCV method. You can choose appropriate features from the datasets and a target feature to answer the question asked in the scenario in the case of supervised learning. [0 - 30]
- Collect and develop a dataset based on the agriculture topic related to Ireland as well as other parts of the world. Perform a sentimental analysis for an appropriate agricultural topic (e.g., product price, feed quality etc…) for producers and consumers point of view in Ireland. [0 - 25]
- You should train and test for Supervised Learning and other appropriate metrics for unsupervised/semi-supervised machine learning models that you have chosen. Use cross validation to provide authenticity of the modelling outcomes. You can apply dimensionality reduction methods to prepare the dataset based on your machine learning modelling requirements. [0 - 30]
- A Table or graphics should be provided to illustrate the similarities and contrast of the Machine Learning modelling outcomes based on the scoring metric used for the analysis of the above-mentioned scenario. Discuss and elaborate your understanding clearly. [0 - 15]

Total Mark = 30+25+30+15=100:(100\*0.5=50%)

### Data Preparation & Visualisation Tasks
- Discuss in detail the process of acquiring your raw data, detailing the positive and/or negative aspects of your research and acquisition . This should include the relevance and implications of any and all licensing/permissions associated with the data. [0-15]
- Exploratory Data Analysis helps to identify patterns, inconsistencies, anomalies, missing data, and other attributes and issues in data sets so problems can be addressed. Evaluate your raw data and detail, in depth, the various attributes and issues that you find. Your evaluation should reference evidence to support your chosen methodology and use visualizations to illustrate your findings. [0-
25]
- Taking into consideration the tasks required in the machine learning section, use appropriate data cleaning, engineering, extraction and/or other techniques to structure and enrich your data. Rationalize your decisions and implementation, including evidence of how your process has addressed the problems identified in the EDA (Exploratory Data Analysis) stage and how your structured data will assist in the analysis stage. This should include visualizations to illustrate your work and evidence to support your methodology.[0-30]
- Modern farming has a great dependence on technology and relies upon visualizations to communicate information, this includes web based, mobile based and many other digital transmission formats. Develop an interactive dashboard tailored to modern farmers, using tufts principles, to showcase the information/evidence gathered following your Machine Learning Analysis. Detail the rationale for approach and visualisation choices made during development. Note you may not use Powerbi, rapidminer, tableau or other such tools to accomplish this (at this stage).[0-30]

Total Mark = 15+25+30+30=100:(100\*0.5=50%)

### Additional notes :
All:
- Your documentation should present your approach to the project, including elements of project planning ( timelines).
- Ensure that your documentation follows a logical sequence through the planning / research / justification / implementation phases of the project.
- Ensure that your final upload contains a maximum of 1 jupyter notebook per module.
- Please ensure that additional resources are placed and linked to a logical file structure eg, Scripts, Images, Report, Data etc…
- Ensure that you include your raw and structured datasets in your submission
- 3000(+/- 10%) words in report (not including code, code comments, titles, references or citations)
- Your Word count MUST be included

(it is expected that research be carried out beyond class material)

Submission Requirements All assessment submissions must meet the minimum requirements listed below.
Failure to do so may have implications for the mark awarded.

All assessment submissions must:
- Jupyter Notebook, Word Document, Dashboard and version control address
- Be submitted by the deadline date specified or be subject to late submission penalties
- Be submitted via Moodle upload
- Use Harvard Referencing when citing third party material
- Be the student’s own work.
- Include the CCT assessment cover page.

In [41]:
import pandas as pd

In [42]:
df = pd.read_csv('Data/FAOSTAT_data_en_12-19-2022_IRL_SPA_crop_livestock_products.csv')

In [43]:
df.head()

Unnamed: 0,Domain Code,Domain,Area Code (ISO3),Area,Element Code,Element,Item Code (CPC),Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,TCL,Crops and livestock products,IRL,Ireland,5610,Import Quantity,1929.07,"Abaca, manila hemp, raw",1961,1961,tonnes,,O,Missing value
1,TCL,Crops and livestock products,IRL,Ireland,5610,Import Quantity,1929.07,"Abaca, manila hemp, raw",1962,1962,tonnes,,O,Missing value
2,TCL,Crops and livestock products,IRL,Ireland,5610,Import Quantity,1929.07,"Abaca, manila hemp, raw",1963,1963,tonnes,,O,Missing value
3,TCL,Crops and livestock products,IRL,Ireland,5610,Import Quantity,1929.07,"Abaca, manila hemp, raw",1964,1964,tonnes,,O,Missing value
4,TCL,Crops and livestock products,IRL,Ireland,5610,Import Quantity,1929.07,"Abaca, manila hemp, raw",1965,1965,tonnes,,O,Missing value


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172266 entries, 0 to 172265
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Domain Code       172266 non-null  object 
 1   Domain            172266 non-null  object 
 2   Area Code (ISO3)  172266 non-null  object 
 3   Area              172266 non-null  object 
 4   Element Code      172266 non-null  int64  
 5   Element           172266 non-null  object 
 6   Item Code (CPC)   172266 non-null  object 
 7   Item              172266 non-null  object 
 8   Year Code         172266 non-null  int64  
 9   Year              172266 non-null  int64  
 10  Unit              172266 non-null  object 
 11  Value             164530 non-null  float64
 12  Flag              172266 non-null  object 
 13  Flag Description  172266 non-null  object 
dtypes: float64(1), int64(3), object(10)
memory usage: 18.4+ MB


In [45]:
df = df.dropna(subset=['Value'])

In [46]:
df['Item'].unique()

array(['Abaca, manila hemp, raw', 'Almonds, in shell', 'Almonds, shelled',
       'Animal oils and fats n.e.c.',
       'Animal or vegetable fats and oils and their fractions, chemically modified, except those hydrogenated, inter-esterified, re-esterified or elaidinized; inedible mixtures or preparations of animal or vegetable fats or oils',
       'Anise, badian, coriander, cumin, caraway, fennel and juniper berries, raw',
       'Apple juice', 'Apple juice, concentrated', 'Apples', 'Apricots',
       'Apricots, dried', 'Areca nuts', 'Artichokes', 'Asparagus',
       'Asses', 'Avocados', 'Bananas', 'Barley', 'Barley, pearled',
       'Beans, dry', 'beef and veal preparations nes',
       'Beer of barley, malted', 'Bees', 'Beeswax', 'Beet pulp',
       'Blueberries', 'Bran of maize', 'Bran of wheat',
       'Brazil nuts, shelled', 'Bread', 'Breakfast cereals',
       'Brewing or distilling dregs and waste',
       'Broad beans and horse beans, dry', 'Buckwheat', 'Buffalo',
       'Bulg

In [47]:
df[df['Item']=='Turkeys']

Unnamed: 0,Domain Code,Domain,Area Code (ISO3),Area,Element Code,Element,Item Code (CPC),Item,Year Code,Year,Unit,Value,Flag,Flag Description
77436,TCL,Crops and livestock products,IRL,Ireland,5609,Import Quantity,02152,Turkeys,1961,1961,1000 Head,0.0,A,Official figure
77437,TCL,Crops and livestock products,IRL,Ireland,5609,Import Quantity,02152,Turkeys,1962,1962,1000 Head,0.0,A,Official figure
77438,TCL,Crops and livestock products,IRL,Ireland,5609,Import Quantity,02152,Turkeys,1963,1963,1000 Head,0.0,A,Official figure
77439,TCL,Crops and livestock products,IRL,Ireland,5609,Import Quantity,02152,Turkeys,1964,1964,1000 Head,0.0,A,Official figure
77440,TCL,Crops and livestock products,IRL,Ireland,5609,Import Quantity,02152,Turkeys,1965,1965,1000 Head,0.0,A,Official figure
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165329,TCL,Crops and livestock products,ESP,Spain,5922,Export Value,02152,Turkeys,2016,2016,1000 US$,143.0,A,Official figure
165330,TCL,Crops and livestock products,ESP,Spain,5922,Export Value,02152,Turkeys,2017,2017,1000 US$,172.0,A,Official figure
165331,TCL,Crops and livestock products,ESP,Spain,5922,Export Value,02152,Turkeys,2018,2018,1000 US$,118.0,A,Official figure
165332,TCL,Crops and livestock products,ESP,Spain,5922,Export Value,02152,Turkeys,2019,2019,1000 US$,129.0,A,Official figure


In [48]:
df['Element'].unique()

array(['Import Quantity', 'Import Value', 'Export Quantity',
       'Export Value'], dtype=object)

Getting new data

Okay. So. Let's look at poultry (meat) farming in Ireland (and Spain). In Ireland, let's see what factors affect chicken prices. I'm mainly going to look at feed and the price of grains. According to an EU factsheet on Irish agriculture, feeding stuff is by far the largest expense in agriculture, around five times the cost of the next largest expense, energy. (This is for all agriculture in Ireland though, so I'm sure cattle data is dominating these stats). Feed stuff for chicken is predominately maize, soya and wheat. Also maybe pig protein, but feeding chickens pigs was only made legal in the EU in Aug 2021 so I'm not going to look into it. Maize and soya are almost entirely imported. Feeding suppliers buy the grains from USA, Ukraine, Russia, etc, and then Irish poultry farmers buy their chicken feed from Irish feed suppliers. It is hard to find stats for feed supplies (especially to only relate to poultry) so I'm going to look at grain prices. Anyway, do grain prices have an impact on the consumer price of chicken? Also, what other factors affect the consumer price of chicken meat? Farm size? Intensive vs free range vs organic? Market demand? The price of imported cheap chicken from Brazil, Ukraine and Thailand, driven by fast food in Ireland's high demand for cheap chicken that the high prices of raw feed in Ireland can't meet (teagasc.ie). Apparently cheap imported chicken is undercutting European farmers (the guardian in their article about feeding animals to animals), let's see what's the truth of that.

What's the chicken consumpution per capita of Ireland vs Spain? Price of chicken relative to our gdp? Are we being screwed in our chicken price or is it line with our higher cost of living and wages in general?

In [11]:
from os import listdir
from os.path import isfile, join
FAOfiles = [f for f in listdir('./datasets/') if isfile(join('./datasets/', f))]

In [12]:
FAOfiles

['FAOSTAT_data_en_IRL_SPA_consumer_price_indices.csv',
 'FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL.csv',
 'FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_TCL.csv',
 'FAOSTAT_data_en_IRL_SPA_production_indices.csv',
 'FAOSTAT_data_en_IRL_SPA_trade_indices.csv',
 'FAOSTAT_data_en_IRL_SPA_value_agri_production.csv',
 'FAOSTAT_data_en_structural_data_from_agri_census.csv',
 'FAOSTAT_data_en__IRL_SPA_producer_prices.csv']

In [13]:
df_dict = {}
for file in FAOfiles:
    df_dict[file[:-4]] = pd.read_csv('./datasets/'+file)

In [14]:
df_dict

{'FAOSTAT_data_en_IRL_SPA_consumer_price_indices':      Domain Code                  Domain  Area Code (M49)     Area  Year Code  \
 0             CP  Consumer Price Indices              372  Ireland       2000   
 1             CP  Consumer Price Indices              372  Ireland       2000   
 2             CP  Consumer Price Indices              372  Ireland       2000   
 3             CP  Consumer Price Indices              372  Ireland       2000   
 4             CP  Consumer Price Indices              372  Ireland       2000   
 ...          ...                     ...              ...      ...        ...   
 1609          CP  Consumer Price Indices              724    Spain       2022   
 1610          CP  Consumer Price Indices              724    Spain       2022   
 1611          CP  Consumer Price Indices              724    Spain       2022   
 1612          CP  Consumer Price Indices              724    Spain       2022   
 1613          CP  Consumer Price Indices       

In [16]:
print(df_dict.keys())

dict_keys(['FAOSTAT_data_en_IRL_SPA_consumer_price_indices', 'FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL', 'FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_TCL', 'FAOSTAT_data_en_IRL_SPA_production_indices', 'FAOSTAT_data_en_IRL_SPA_trade_indices', 'FAOSTAT_data_en_IRL_SPA_value_agri_production', 'FAOSTAT_data_en_structural_data_from_agri_census', 'FAOSTAT_data_en__IRL_SPA_producer_prices'])


In [17]:
df_dict['FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL']

Unnamed: 0,Domain Code,Domain,Area Code (M49),Area,Element Code,Element,Item Code (CPC),Item,Year Code,Year,Unit,Value,Flag,Flag Description
0,QCL,Crops and livestock products,372,Ireland,5312,Area harvested,115.0,Barley,1961,1961,ha,146373.0,A,Official figure
1,QCL,Crops and livestock products,372,Ireland,5312,Area harvested,115.0,Barley,1962,1962,ha,164219.0,A,Official figure
2,QCL,Crops and livestock products,372,Ireland,5312,Area harvested,115.0,Barley,1963,1963,ha,173608.0,A,Official figure
3,QCL,Crops and livestock products,372,Ireland,5312,Area harvested,115.0,Barley,1964,1964,ha,183644.0,A,Official figure
4,QCL,Crops and livestock products,372,Ireland,5312,Area harvested,115.0,Barley,1965,1965,ha,187790.0,A,Official figure
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2653,QCL,Crops and livestock products,724,Spain,5510,Production,111.0,Wheat,2017,2017,tonnes,4875957.0,A,Official figure
2654,QCL,Crops and livestock products,724,Spain,5510,Production,111.0,Wheat,2018,2018,tonnes,8322510.0,A,Official figure
2655,QCL,Crops and livestock products,724,Spain,5510,Production,111.0,Wheat,2019,2019,tonnes,6041170.0,A,Official figure
2656,QCL,Crops and livestock products,724,Spain,5510,Production,111.0,Wheat,2020,2020,tonnes,8143510.0,A,Official figure


In [18]:
df_dict['FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL']['Element'].unique()

array(['Area harvested', 'Yield', 'Production', 'Stocks',
       'Yield/Carcass Weight', 'Producing Animals/Slaughtered'],
      dtype=object)

In [19]:
df_dict['FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL']['Unit'].unique()

array(['ha', 'hg/ha', 'tonnes', '1000 Head', '0.1g/An'], dtype=object)

Unit translation \[I have a natural inclination for SI units\]:

- ha = hectare (area) = 10000 m$^2$
- hg/ha = hectogram/hectare (yield of cereal) = 0.1 kg / 10000 m$^2$
- tonnes = 1000 kg
- 1000 Head = I don't know exactly but it's some kind of animal unit
- 0.1g/An = 0.1 g per animal

In [23]:
df_dict['FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL'][df_dict['FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL']['Unit']== '1000 Head']['Element'].unique()

array(['Stocks', 'Producing Animals/Slaughtered'], dtype=object)

In [24]:
df_dict['FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL']['Item'].unique()

array(['Barley', 'Cereals n.e.c.', 'Chickens', 'Maize (corn)',
       'Meat of chickens, fresh or chilled', 'Mixed grain', 'Oats',
       'Soya beans', 'Sunflower seed', 'Wheat', 'Millet'], dtype=object)

In [30]:
items = df_dict['FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL']['Item'].unique()

In [31]:
items

array(['Barley', 'Cereals n.e.c.', 'Chickens', 'Maize (corn)',
       'Meat of chickens, fresh or chilled', 'Mixed grain', 'Oats',
       'Soya beans', 'Sunflower seed', 'Wheat', 'Millet'], dtype=object)

In [32]:
chicken_item = []
grains_item = []
for item in items:
    if 'hicken' in item:
        chicken_item.append(item)
    else:
        grains_item.append(item)

In [34]:
print('Chicken list: ', chicken_item)
print('Grains list: ', grains_item)

Chicken list:  ['Chickens', 'Meat of chickens, fresh or chilled']
Grains list:  ['Barley', 'Cereals n.e.c.', 'Maize (corn)', 'Mixed grain', 'Oats', 'Soya beans', 'Sunflower seed', 'Wheat', 'Millet']


In [49]:
df_dict['FAOSTAT_data_en_IRL_SPA_crops_and_livestock_products_QCL'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2658 entries, 0 to 2657
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Domain Code       2658 non-null   object 
 1   Domain            2658 non-null   object 
 2   Area Code (M49)   2658 non-null   int64  
 3   Area              2658 non-null   object 
 4   Element Code      2658 non-null   int64  
 5   Element           2658 non-null   object 
 6   Item Code (CPC)   2658 non-null   float64
 7   Item              2658 non-null   object 
 8   Year Code         2658 non-null   int64  
 9   Year              2658 non-null   int64  
 10  Unit              2658 non-null   object 
 11  Value             2658 non-null   float64
 12  Flag              2658 non-null   object 
 13  Flag Description  2658 non-null   object 
dtypes: float64(2), int64(4), object(8)
memory usage: 290.8+ KB
