

<h1 align='center'> COMP2420/COMP6420 - Introduction to Data Management,<br/> Analysis and Security</h1>

<h1 align='center'> Assignment - 1</h1>

-----
<br/>

## Grading

|**Maximum Marks**         |**100**
|--------------------------|--------
|  **Weight**              |  **15% of the Total Course Grade**
|  **Submission deadline** |  **11:59PM, Friday, March 31**
|  **Submission mode**     |  **Electronic, Using GitLab**
|  **Penalty**             |  **100% after the deadline**


## Learning Outcomes
The following learning outcomes apply to this piece:
- **LO3** - Demonstrate basic knowledge and understanding of descriptive and predictive data analysis methods, optimization and search, and knowledge representation.
- **LO4** - Formulate and extract descriptive and predictive statistics from data
- **LO5** - Analyse and interpret results from descriptive and predictive data analysis
- **LO6** - Apply their knowledge to a given problem domain and articulate potential data analysis problems


## Submission

You need to submit the following items:
- The notebook `Assignment_1_2023_uXXXXXXX.ipynb` (where uXXXXXXX is your uid) 
- A completed `statement-of-originality.md`, found in the root of the forked gitlab repo.

Submissions are performed by pushing to your forked GitLab assignment repository. For a refresher on forking and cloning repositories, please refer to `Lab 1`. Issues with your Git repo (with the exception of a CECS/ANU wide Gitlab failure) will not be considered as grounds for an extension. You will also need to add your details below. Any variation of this will result in a `zero mark`.

***** 

### Notes:

* It is strongly advised to read the whole assignment before attempting it and have at least a cursory glance at the dataset in order to gauge the requirements and understand what you need to do as a bigger picture.
* Backup your assignment to your Gitlab repo often. 
* Extra reading and research will be required. Make sure you include all references in your Statement of Originality. If this does not occur, at best marks will be deduced. Otherwise, academic misconduct processes will be followed.
* For answers requiring free form written text, use the designated cells denoted by `YOUR WRITTEN ANSWER HERE` -- double click on the cell to write inside them. Leave it blank if you don't have anything to write.
* For all coding questions please write your code after the comment `YOUR CODE HERE`. Remember to document your code using comments and doc strings as appropriate.
* In the process of testing your code, you can insert more cells or use print statements for debugging, but when submitting your file remember to remove these cells and calls respectively. You are welcome to add additional cells to the final submission, provided they add value to the overall piece.
* You will be marked on **correctness** and **readability** of your code, if your marker can't understand your code your marks may be deducted.
* Comment your code.
* Before submitting, restart the kernel in Jupyter Lab and re-run all cells before submitting your code. This will ensure the namespace has not kept any old variables, as these won't come across in submission and your code will not run. Without this, you could lose a significant number of marks.

*****


### Enter your Student ID below:

## Introduction

The study of sea ice in polar regions is of great importance due to the critical role that sea ice plays in global climate and ecosystem health. Sea ice covers vast expanses of the Arctic and Antarctic oceans, acting as a reflective surface that helps regulate the Earth's temperature by reflecting sunlight back into space. As such, changes in sea ice cover (herein, we refer it to as sea ice extent) can have significant impacts on global climate patterns and ocean currents. Sea ice also provides a critical habitat for a wide range of Arctic and Antarctic marine species, including krill, polar bears, and various species of seals. The health and abundance of these species are directly linked to the extent and duration of sea ice cover extent. Furthermore, sea ice serves as a major transportation route for commercial shipping and resource extraction. Given these factors, the study of sea ice in polar regions is essential for understanding and mitigating the impacts of climate change and ensuring the long-term sustainability of polar ecosystems and human societies. 

The primary focus of this assignment is to conduct a some basic data analysis of sea ice extent in the polar region. The analysis will be centered on the study of a given data set (SeaIceExtent.csv). Through this study, the goal is to gain a preliminary comprehension of how sea ice extent is changing in the polar regions over time. The analysis will involve the use of statistical tools and techniques to identify trends, patterns, and anomalies in the data, as well as to develop models, particularly, linear regression models, that can help predict future changes in sea ice cover. The findings of this study will be potentially valuable in informing policy decisions related to climate change mitigation and adaptation, as well as in guiding the development of sustainable strategies for managing polar resources and ecosystems. Overall, this assignment represents an important effort to contribute to our understanding of one of the most critical and rapidly changing aspects of the Earth's climate system. With all this in mind, I hope you find joy in engaging with this assignment beyond simply completing it.



*******************
## Package Imports

In [137]:
# Common Imports
import math

import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
plt.style.use('seaborn')
%matplotlib inline

In [138]:
# Import additional modules here as required
#
# Note that only modules in the standard Anaconda distribution are allowed. 
# If you need to install it manually, it is not an accepted package.
#
import statistics
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split


**Several notes on printing and plotting**
* Throughout this assignment, when writing codes to print, your codes should <u>**include the relevant units and/or give relevant description**</u> of what you are printing. 

* A professional plot typically has the following characteristics:

    * Clear and concise labeling: A professional plot should have clear and concise labeling of the x and y axes, as well as a clear title that conveys the purpose of the plot.

    * Appropriate scales: The scales on the x and y axes should be appropriate for the data being presented. This means that the scales should be chosen so that the data is not too compressed or stretched out, and so that important features of the data are easily visible.

    * Appropriate plot type: The plot type should be appropriate for the data being presented. For example, if the data is continuous, a line plot or a scatter plot may be appropriate. If the data is categorical, a bar chart or a pie chart may be more appropriate.

    * Clarity: A professional plot should be visually clear and easy to interpret. This means that unnecessary elements should be removed, colors should be used judiciously, and the plot should be free of clutter.

    * Consistency: A professional plot should be consistent in its formatting with other plots that may be presented in the same report. This means that font sizes, line widths, and colors should be consistent across all plots.


****
## Q1: Loading and Basic Analysis of the Data

Briefly state your work on each Task and justify your decisions. For example, when using an existing function, briefly state what the function does (e.g., a mean function calculates the average values of the ensambles). 


#### Task 1: Load the data file SeaIceExtent.csv into a Pandas DataFrame and make it ready for use. (4 marks)

**Hints:**
* You might want to drop unnecessary or rededundant columns and/or rows.

* You might want to rename some columns and/or rows.

In [103]:
# YOUR CODE HERE
#read the data
data = pd.read_csv('SeaIceExtent.csv')
#rewrite the data to different colum names
data.rename(columns={'     Extent (Antarctic)' : 'Antarctic_ex',
                    '    Missing (Antarctic)' : 'Antarctic_mis',
                    '     Extent (Arctic)' : 'Arctic_ex',
                    '    Missing (Arctic)' : 'Arctic_mis'}, inplace=True)
#drop the unit part
data = data.drop([0], axis=0)
#add date column at the end
data['date'] = data['Year'].astype(str) + '-' + data[' Month'].astype(str) + '-' + data[' Day'].astype(str)
data['date'] = data["date"].apply(pd.to_datetime, format='%Y/%m/%d')
#drop the unnecessary columns
data = data.drop(data.columns[[7,8]], axis=1)
print(data)

#Year month day to YYYY-MM-DD
data_use = pd.read_csv("SeaIceExtent.csv",parse_dates= { 'date': ['Year', ' Month', ' Day'] },skiprows=[1] ,usecols=[0,1,2,3,4,5,6,8])
data_use.rename(columns={'     Extent (Antarctic)' : 'Antarctic_ex',
                    '    Missing (Antarctic)' : 'Antarctic_mis',
                    '     Extent (Arctic)' : 'Arctic_ex',
                    '    Missing (Arctic)' : 'Arctic_mis'}, inplace=True)

print(data_use)

# (ADD ANY ADDITIONAL CELLS AS REQUIRED; Same below)


       Year  Month  Day Antarctic_ex Antarctic_mis Arctic_ex Arctic_mis  \
1      1978     10   26       17.624             0    10.231          0   
2      1978     10   28       17.803             0     10.42          0   
3      1978     10   30        17.67             0    10.557          0   
4      1978     11    1       17.527             0     10.67          0   
5      1978     11    3       17.486             0    10.777          0   
...     ...    ...  ...          ...           ...       ...        ...   
14548  2023      3    2        1.908             0    14.627          0   
14549  2023      3    3        1.877             0     14.61          0   
14550  2023      3    4        1.896             0    14.641          0   
14551  2023      3    5        1.939             0    14.646          0   
14552  2023      3    6        2.034             0    14.579      0.003   

            date  
1     1978-10-26  
2     1978-10-28  
3     1978-10-30  
4     1978-11-01  
5   

#### Task 2: Data basic visualisation 

- Task 2.1: Access the data and print the Antarctic and Arctic sea ice extents of week 2 of S1 2023 (February 27 to March 3, 2023). (2 marks)
    

In [104]:
# YOUR CODE HERE
#copy the data make sure it is not changed
data_task2 = data_use.copy()
data_task2.set_index('date', inplace = True)
start_date = '2023-02-27'
end_date = '2023-03-03'
res = data_task2.loc[start_date : end_date, ['Antarctic_ex','Arctic_ex']]
print(res)




            Antarctic_ex  Arctic_ex
date                               
2023-02-27         1.856     14.462
2023-02-28         1.848     14.476
2023-03-01         1.907     14.529
2023-03-02         1.908     14.627
2023-03-03         1.877     14.610


- Task 2.2: In one figure, plot the daily trend of the Antarctic and Arctic sea ice extents.  (4 marks)
    

In [None]:
# YOUR CODE HERE
plt.figure(figsize=(15, 6))
#plot the data
plt.plot(data_use['date'], data_use['Antarctic_ex'], label='Antarctic')
plt.plot(data_use['date'], data_use['Arctic_ex'], label='Arctic')

plt.title('Daily Trend of Antarctic extents and Arctic extents')
plt.xlabel('Year')
plt.ylabel('Extent 10^6  km^2')

plt.legend()

plt.show()




#### Task 3: Data basic processing
    
- Task 3.1: Identify the missing dates (the dates when the sea ice extents are not available): Store the total number of dates of each year in a table where each row specifies the year (column 1) and the total number of dates (column 2). Print the table. (4 marks)
    

In [None]:


# YOUR CODE HERE

date_res = data_use.copy()
date_res.index = pd.to_datetime(date_res['date'])
#get the start date and end date
start_date = date_res.index[0]
end_date = date_res.index[-1]
#find the missing dates rangs to a day
date_range = pd.date_range(start_date, end_date, freq='D').difference(date_res.index)
missing_dates = {}

# count the number of missing dates by loop
for i in date_range.year.unique():
    missing_dates[i] = len(date_range[date_range.year == i])

# create a dataframe from the missing_dates dictionary and print the table
missing_dates_table = pd.DataFrame.from_dict(missing_dates, orient='index', columns=['Total Number of Missing Dates'])
print(missing_dates_table)
print("after 1988, the Total Number of missing dates are 0")




- Task 3.2: Evaluate the sea ice extent monthly averages; then similar to Task 2.2, in one figure, plot the monthly trend of the Antarctic and Arctic sea ice extents. (4 marks)

In [None]:
# YOUR CODE HERE
data_get_monthly = data_use.copy()
#Helper, iterate the date to month and find the missing dates and print the table
data_get_monthly['month'] = data_get_monthly['date'].dt.month
data_get_monthly['year'] = data_get_monthly['date'].dt.year
data_get_monthly['day'] = data_get_monthly['date'].dt.day
data_get_monthly['date'] = data_get_monthly['date'].dt.strftime('%Y-%m')
data_get_monthly = data_get_monthly.groupby(['date','month','year']).mean()
data_get_monthly = data_get_monthly.reset_index()
print(data_get_monthly)


plt.figure(figsize=(15, 6))
plt.plot(data_get_monthly['date'], data_get_monthly['Antarctic_ex'], label='Antarctic')
plt.plot(data_get_monthly['date'], data_get_monthly['Arctic_ex'], label='Arctic')
plt.title('Monthly Trend of Antarctic extents and Arctic extents')
plt.xlabel('Year')
plt.ylabel('Extent (million km$^2$)')
plt.legend()
plt.show()

In [None]:
# YOUR CODE HERE



In [None]:
#print(data)
#list(data.columns)

#### Task 4: Study the central tendency of the data

Recall from Lecture 3 that the most common measures of centre tendency are the 3M: Mode, Median, and
Mean. We study these 3Ms one-by-one as below.

- Task 4.1: Calculate and print the mode of both the Antarctic and Arctic sea ice extents. For this calculation, round both the Antarctic and Arctic sea ice extents to the nearest million. (4 marks)
    

In [None]:
# YOUR CODE HERE
to_numb = data['Antarctic_ex'].astype(float)
to_numA = data['Arctic_ex'].astype(float)
antarctic_mode = round(to_numb.mode()[0])
print("Antarctic sea ice extent mode:", antarctic_mode, '10^6  km^2')

# Calculate the mode of Arctic sea ice extent
arctic_mode = round(to_numA.mode()[0])
print("Arctic sea ice extent mode:", arctic_mode,'10^6  km^2')

- Task 4.2: Calculate and print the median of both the Antarctic and Arctic sea ice extents (2 marks)
    

In [None]:
# YOUR CODE HERE

ant_med =data['Antarctic_ex'].median()
arc_med = data['Arctic_ex'].median()
print("The median of Antarctic sea ice extents are ",ant_med,'10^6  km^2')
print("The median of Arctic sea ice extents are ",arc_med),'10^6  km^2'

- Task 4.3: Calculate and print the mean of both the Antarctic and Arctic sea ice extents (2 marks)
    

In [None]:
# YOUR CODE HERE

antarctic_only = data['Antarctic_ex']
convert_antarctic = [float(i) for i in antarctic_only]
ant_mean = np.mean(convert_antarctic)
arctic_only = data['Arctic_ex']
convert_arctic = [float(i) for i in arctic_only]
art_mean = np.mean(convert_arctic)

print("The mean of Antarctic sea ice extents are ", ant_mean,'10^6  km^2')
print("The mean of Arctic sea ice extents are ", art_mean,'10^6  km^2')



#### Task 5: Study the variability of the data

- Task 5.1: Create two new columns to store the sums of the missing extents and the corresponding polar regions (adding Missing Antarctic to Antarctic; adding Missing Arctic to Arctic). (2 marks) 
    

In [None]:
# YOUR CODE HERE
data['Total_Antarctic'] = data['Antarctic_ex'].astype(float) + data['Antarctic_mis'].astype(float)
data['Total_Arctic'] = data['Arctic_ex'].astype(float) + data['Arctic_mis'].astype(float)

print(data)




- Task 5.2: Using the two new columns, calculate and print the range of both the Antarctic and Arctic sea ice extents (2 marks)
    

In [None]:
# YOUR CODE HERE
#to numeric in order to get the range
ant_num = pd.to_numeric(data['Total_Antarctic'])
arc_num = pd.to_numeric(data['Total_Arctic'])
#get the range by using max and min
range_antarctic_diff = ant_num.max() - ant_num.min()
range_arctic_diff = arc_num.max() - arc_num.min()
print(f"Antarctic range: {range_arctic_diff:.3f} ",'10^6  km^2')
print(f"Arctic range: {range_antarctic_diff:.3f} ",'10^6  km^2')
print("Range of Antarctic sea ice =  Min: " , data['Total_Antarctic'].min() , " Max: " , data['Total_Antarctic'].max(),'10^6  km^2')
print("Range of Antarctic sea ice =  Min: " , data['Total_Arctic'].min() , " Max: " , data['Total_Arctic'].max(),'10^6  km^2')



- Task 5.3: Using the two new columns, calculate and print the variance of both the Antarctic and Arctic sea ice extents (2 marks)
    

In [None]:
# YOUR CODE HERE
#get the variance
ant_var = ant_num.var()
arc_var = arc_num.var()
print("The Variable of Antarctic sea ice extents are ", ant_var, ' + 10^6  km^2')
print("The Variable of Arctic sea ice extents are " ,arc_var, ' + 10^6  km^2')


- Task 5.4: Using the two new columns, calculate and print the standard deviation of both the Antarctic and Arctic sea ice extents (2 marks)

In [None]:
# YOUR CODE HERE
#get the standard deviation
ant_sd = ant_num.std()
arc_std = arc_num.std()
print("The SD of Antarctic sea ice extents are ",ant_sd, ' + 10^6  km^2')
print("The SD of Arctic sea ice extents are ",arc_std, ' + 10^6  km^2')

******
## Q2: EXPLORATORY DATA ANALYSIS

In this section you are expected to do an exploratory data analysis on the dataset. Herein, we use the data generated in Task 5.1. 

- Task 1: Explore the correlation of the Antarctic and Arctic sea ice extents. You need to use both plot/plots and correlation test. (6 marks)

In [None]:
# YOUR CODE HERE
plt.figure(figsize=(15, 6))
plt.scatter(data['Total_Antarctic'], data['Total_Arctic'], color = 'red', label = 'Antarctic vs Arctic')
plt.xlabel('Antarctic sea ice extent (10^6 sq km)')
plt.ylabel('Arctic sea ice extent (10^6 sq km)')
plt.title('Antarctic vs Arctic sea ice extent')
plt.legend()
plt.show()
corr = data['Total_Antarctic'].corr(data['Total_Arctic'])
print("The correlation for this chart is",corr)

- Task 2: Shift the Arctic sea ice extent **forward** by six months (for example, this shift moves the date 15/01/2000 to 15/07/2000) and process your data appriopriately so that the dates match (4 marks)

    **Hints**: (a) basically you want to remove the first six months of the Antarctic sea ice extent and the last six months so that the time window matches; (b) there are dates where either Antarctic sea ice extent or Arctic sea ice extent is missing; remove these rows to match. 

In [None]:
# YOUR CODE HERE
data_arc = data[['date', 'Total_Arctic']]
#shift the date by 6 months
data_arc["date"] = data_arc["date"] + pd.DateOffset(months=6)
daat_ant = data[['date', 'Total_Antarctic']]
data_merge = pd.merge(data_arc, daat_ant, on='date', how='inner')
#drop lines include null
data_merge = data_merge.dropna()
print(data_merge)



- Task 3: Explore the correlation of the new data set. (4 marks) 
    

In [None]:
# YOUR CODE HERE
plt.figure(figsize=(15, 6))
plt.scatter(data_merge['Total_Antarctic'], data_merge['Total_Arctic'], color = 'red', label = 'Antarctic vs Arctic after shifting')
plt.xlabel('Antarctic sea ice extent (10^6 sq km)')
plt.ylabel('Arctic sea ice extent (10^6 sq km)')
plt.title('Antarctic vs Arctic sea ice extent after shifting')
plt.legend()
plt.show()
correlation= data_merge['Total_Antarctic'].corr(data_merge['Total_Arctic'])
print("The cprrelation is " + str(correlation))

- Task 4: Compare the results of Task 3 with Task 1 and discuss the comparison. (4 marks)
    

In [None]:
# YOUR CODE HERE
plt.figure(figsize=(15, 6))
plt.scatter(data['Total_Antarctic'], data['Total_Arctic'], color = 'red', label = 'Antarctic vs Arctic')
plt.xlabel('Antarctic sea ice extent (10^6 sq km)')
plt.ylabel('Arctic sea ice extent (10^6 sq km)')
plt.title('Antarctic vs Arctic sea ice extent')
plt.legend()
plt.figure(figsize=(15, 6))
plt.scatter(data_merge['Total_Antarctic'], data_merge['Total_Arctic'], color = 'blue', label = 'Antarctic vs Arctic after shifting')
plt.xlabel('Antarctic sea ice extent (10^6 sq km)')
plt.ylabel('Arctic sea ice extent (10^6 sq km)')
plt.title('Antarctic vs Arctic sea ice extent')
plt.legend()
plt.show()


- Task 5: Further exploration (12 marks)
   
    * Perform Boxplot analysis of both the the Antarctic and Arctic sea ice extents and describe the boxplots and discuss the comparison of the two. (6 marks)
    
    * Perform both one-sample (for Antartic sea ice extent with known value 1.0x10^7 square kilometers) and two-sample T-tests. (6 marks) **Hints** (a) The Lecture note 3 may be helpful (for example see page 49 regarding the definition); (b) Announce some descriptive stats on your data and explain what they mean in the context of this dataset.
    

In [None]:
# YOUR CODE HERE
plt.boxplot([data['Total_Antarctic'],data['Total_Arctic']])
plt.xticks([1,2],['Antarctic','Arctic'])
plt.xlabel('Region)')
plt.ylabel('Result of Sea ice extent (10^6 sq km)')

plt.show()



In [None]:
print(np.var(data['Total_Arctic'].astype(float)), np.var(data['Total_Antarctic'].astype(float)))
print(stats.ttest_ind(data['Total_Arctic'].astype(float), data['Total_Antarctic'].astype(float), equal_var=False))

In this case we use the one sample t-test to comparing the mean of a sample to our hypothsied value
H0 is the null hypothesis which is the mean of the sample is equal to the hypothesised value
H1 is the alternative hypothesis which is the mean of the sample is not equal to the hypothesised value
In this case we assume the value is 1.0*10^7 km^2 which means the H0 is the mean south pole sea ice extent is == 1.0*10^7 km^2 and H1 is the mean south pole sea ice extent is not == 1.0*10^7 km^2

As the above result, we can use the p-value to decide whether we reject the null hypothesis or not. If the p value is too small which means the null hypothesis is not true. In this case, the p value is 0.000 which means we can reject the null hypothesis. The mean of the sample is not equal to the hypothesised value.

******
## Q3: Linear Regression

The exciting stage is here! Your task is to create several Regression Models in this section. The goal is to build useful models (keep in mind that "All models are wrong, but some are useful", a quote by George Box) for forecasting the evolution of the sea ice extent. For this purpose and for simplicity, we focus on Antarctic region (only use data from the year 1980 to 2022 for the Antarctic sea ice except for Task 5 below). Herein, we use the data generated in Task 5.1. Specifically, the tasks are below.


- Task 1: Use the data from the year 1980 to 2021 and build a linear regression model for the Antarctic sea ice extent. Specifically, 

    - The input (x-axis) of the model should be time (the j-th day since the first day in 01/01/1980) and the output is the predicted Antarctic sea ice extent, y-axis, for a given time/date). (4 marks)
    
    - Visualise the model: in a scatter plot of the data from 1980 to 2021, draw the linear regression modeled line. (4 marks)
    
    - Apply the model to predict the sea ice extend for the year of 2022. (2 marks)
    
    - Visualise and plot the comparision of the predicted values of 2022 with the actual values from the data. (4 marks)
    
    - Calculate the root mean squared error and R-squared value of the predicted values of 2022. (2 marks)

In [None]:
# YOUR CODE HERE
#get the data from 1980 to 2021
data['date'] = pd.to_datetime(data['date'])
start = pd.to_datetime('1980-01-01')
end = pd.to_datetime('2021-12-31')
date = (data['date'] >= start) & (data['date'] <= end)
res_date = data.loc[date]
res_date['dayIndex'] = (res_date['date'] - pd.to_datetime('1980-01-01')).dt.days
print(res_date)

df_lr = res_date
train, test = train_test_split(df_lr, test_size=0.2)
x_train = train[['dayIndex']]
y_train = train[['Total_Antarctic']]

x_test = test[['dayIndex']]
y_test = test[['Total_Antarctic']]

# Create an instance of the LinearRegression class and fit the model using the training data
lr = LinearRegression()
model = lr.fit(x_train, y_train)

# Evaluate the model performance using the R-squared score
train_score = model.score(x_train, y_train)
test_score = model.score(x_test, y_test)
print("Train Score:", train_score)
print("Test Score: ", test_score)
y_pred = model.predict(x_test)


plt.figure(figsize=(10, 6))
plt.scatter(x_test['dayIndex'], y_test, color='blue', label='Actual')
plt.plot(x_test['dayIndex'], y_pred, color='red', label = 'model_predict')
plt.xlabel('Days since 1980')
plt.ylabel('Antarctic sea ice extent (10^6 sq km)')
plt.title('Linear Regression Model')
plt.legend()
plt.show()




In [None]:

#YOUR CODE HERE
data['date'] = pd.to_datetime(data['date'])
start = pd.to_datetime('2022-01-01')
end = pd.to_datetime('2022-12-31')
date = (data['date'] >= start) & (data['date'] <= end)
result_date = data.loc[date]
result_date['dayIndex'] = (result_date['date'] - pd.to_datetime('2022-01-01')).dt.days

df_lr = result_date
train, test = train_test_split(df_lr, test_size=0.2)
x_train = train[['dayIndex']]
y_train = train[['Total_Antarctic']]

x_test = test[['dayIndex']]
y_test = test[['Total_Antarctic']]

lr = LinearRegression()
model = lr.fit(x_train, y_train)

# Evaluate the model performance using the R-squared score
train_score = model.score(x_train, y_train)
test_score = model.score(x_test, y_test)
print("Train Score:", train_score)
print("Test Score: ", test_score)
y_pred = model.predict(x_test)

plt.figure(figsize=(10, 6))
plt.scatter(x_test['dayIndex'], y_test, color='blue', label='Actual')
plt.plot(x_test['dayIndex'], y_pred, color='red', label = 'model_predict')
plt.xlabel('Days since 1980')
plt.ylabel('Antarctic sea ice extent (10^6 sq km)')
plt.title('2022 Model of Antarctic sea ice extent')
plt.legend()
plt.show()

# Predict and evaluate
predict = model.predict(np.array(result_date['dayIndex']).reshape(-1, 1))
root_mean = np.sqrt(mean_squared_error(result_date['Total_Antarctic'], predict))
R_squared = model.score(np.array(result_date['dayIndex']).reshape(-1, 1), result_date['Total_Antarctic'])
print('Root mean squared error: ', root_mean)
print('R-squared value: ', R_squared)


- Task 2: Perform the same study as above but build four linear regression models for each year where one for each season in the year (In Australia: December to February is summer; March to May is autumn; June to August is winter; and September to November is spring. **There are only four seasons**). (8 marks: the marks in each of the five parts in Task 1 above are halved) 

    **Requirements and hints are**

    - Your model shall consists of line segments. Specifically, there should be four lines in each year (build a linear regression model for each season). Thus, in total, there should be 42*4=168 line segments. 
    
    - For better visualisation of your model, you may plot each of these line segments with one color for one season. 
    
    - In order to make a prediction about the extent of sea ice in 2022, we must use 168 line segments from the years 1980 to 2021 to create four line segments for 2022. To achieve this, we must determine the slopes and y-intercepts of these four line segments for 2022 using linear regression analysis. For example, we can use the 42 slopes of the lines of Springs over 1980 to 2021 to create a linear regression model to predict the slope of the line for Spring 2022.

In [None]:
from time import strftime

# YOUR CODE HERE

def from_begin(df, begin_day):
    """
    Add a column ["numDays"] into the pandas dataframe that
    counts the number of days from the begin_day
    """
    start_date_col = pd.to_datetime([begin_day] * df.shape[0], format='%Y/%m/%d')
    df['dayIndex'] = (df['date'] - start_date_col).dt.days.astype(float)
    return df.reset_index(drop=True)

min_date = "1979-12"
max_date = "2021-12"
months = pd.date_range(start=min_date, end=max_date, freq='MS')
months = months.strftime("%Y-%m").tolist()

summer = []
autumn = []
winter = []
spring = []




for i in range(0, len(months)-3, +3):
    start_rec = months[i]
    end_rec = months[i+2]
    resu_data = data[data['date'] >= months[i]]
    resu_data = resu_data[resu_data['date'] <= months[i+3]]
    resu_data = from_begin(resu_data, resu_data['date'].min())
    #print(resu_data)

    lr = LinearRegression()
    x_train = resu_data['dayIndex']
    y_train = resu_data['Total_Antarctic']

    model = lr.fit(np.array(x_train).reshape(-1, 1), y_train)

    plt.figure(figsize=(15, 6))
    plt.scatter(x_train, y_train, color = "blue", label = 'data_range_actual')

    b0 = model.intercept_
    b1 = model.coef_[0]
    x_pred = [x_train.min(), x_train.max()]                      # get the bounds for x
    y_pred = model.predict(np.array(x_pred).reshape(-1, 1))  # predict the y values
    if i == 0:
        season = 'summer'
    elif i == 3:
        season = 'autumn'
    elif i == 6:
        season = 'winter'
    else:
        season = 'spring'
    plt.plot(x_pred, y_pred, color='red', label = 'model_predict')
    plt.xlabel('dayIndex in ' + season)
    plt.ylabel('Antarctic sea ice extent (10^6 sq km) in ' + season)
    plt.title('Antarctic sea ice extent in day index with predicate results in ' + season)
    plt.legend()
    plt.show()

















In [None]:
def from_begin(df, begin_day):
    """
    Add a column ["numDays"] into the pandas dataframe that
    counts the number of days from the begin_day
    """
    start_date_col = pd.to_datetime([begin_day] * df.shape[0], format='%Y/%m/%d')
    df['dayIndex'] = (df['date'] - start_date_col).dt.days.astype(float)
    return df.reset_index(drop=True)

min_date = "2021-12"
max_date = "2022-12"
months = pd.date_range(start=min_date, end=max_date, freq='MS')
months = months.strftime("%Y-%m").tolist()

summer = []
autumn = []
winter = []
spring = []



for i in range(0, len(months)-3, +3):
    start_rec = months[i]
    end_rec = months[i+2]
    resu_data = data[data['date'] >= months[i]]
    resu_data = resu_data[resu_data['date'] <= months[i+3]]
    resu_data = from_begin(resu_data, resu_data['date'].min())
    #print(resu_data)

    lr = LinearRegression()
    x_train = resu_data['dayIndex']
    y_train = resu_data['Total_Antarctic']

    model = lr.fit(np.array(x_train).reshape(-1, 1), y_train)

    plt.figure(figsize=(15, 6))
    plt.scatter(x_train, y_train, color = "blue", label = 'data_range_actual')

    b0 = model.intercept_
    b1 = model.coef_[0]
    x_pred = [x_train.min(), x_train.max()]                      # get the bounds for x
    y_pred = model.predict(np.array(x_pred).reshape(-1, 1))  # predict the y values
    plt.plot(x_pred, y_pred, color='red', label = 'model_predict')
    #get season
    if i == 0:
        season = 'summer'
    elif i == 3:
        season = 'autumn'
    elif i == 6:
        season = 'winter'
    else:
        season = 'spring'
        #label in season


    plt.xlabel('dayIndex in ' + season)
    plt.ylabel('Antarctic sea ice extent (10^6 sq km) in ' + season)
    plt.title('Antarctic sea ice extent in day index with predicate results in ' + season)
    plt.legend()
    plt.show()

    R_squared = model.score(np.array(resu_data['dayIndex']).reshape(-1, 1), resu_data['Total_Antarctic'])
    # find root mean squared error
    predict = model.predict(np.array(result_date['dayIndex']).reshape(-1, 1))
    root_mean = np.sqrt(mean_squared_error(result_date['Total_Antarctic'], predict))
    print('Root mean squared error: ', root_mean)
    print('R-squared value: ', R_squared)



- Task 3: Perform the same study as above (taking both Tasks 1 and 2 into consideration) but build 12 linear regression models for each year where one for each month in the year (8 marks: the marks in each of the five parts in Task 1 above are halved.)

In [None]:
# YOUR CODE HERE
def from_begin(df, begin_day):
    """
    Add a column ["numDays"] into the pandas dataframe that
    counts the number of days from the begin_day
    """
    start_date_col = pd.to_datetime([begin_day] * df.shape[0], format='%Y/%m/%d')
    df['dayIndex'] = (df['date'] - start_date_col).dt.days.astype(float)
    return df.reset_index(drop=True)

min_date = "1980-01"
max_date = "2021-12"
months = pd.date_range(start=min_date, end=max_date, freq='MS')
months = months.strftime("%Y-%m").tolist()

Jun = []
Feb = []
Mar = []
Apr = []
May = []
Jun = []
Jul = []
Aug = []
Sep = []
Oct = []
Nov = []
Dec = []


for i in range(0, len(months), +1):
    start_rec = months[i]
    end_rec = months[-1]
    resu_data = data[data['date'] >= months[i]]
    resu_data = resu_data[resu_data['date'] <= months[i+1]]
    resu_data = from_begin(resu_data, resu_data['date'].min())
    #print(resu_data)

    lr = LinearRegression()
    x_train = resu_data['dayIndex']
    y_train = resu_data['Total_Antarctic']

    model = lr.fit(np.array(x_train).reshape(-1, 1), y_train)

    plt.figure(figsize=(15, 6))
    plt.scatter(x_train, y_train, color = "blue", label = 'data_range_actual')

    b0 = model.intercept_
    b1 = model.coef_[0]
    x_pred = [x_train.min(), x_train.max()]                      # get the bounds for x
    y_pred = model.predict(np.array(x_pred).reshape(-1, 1))  # predict the y values
    plt.plot(x_pred, y_pred, color='red', label = 'model_predict')
    plt.xlabel('dayIndex in'+ months[i])
    plt.ylabel('Antarctic sea ice extent (10^6 sq km) in '+ months[i])
    plt.title('Antarctic sea ice extent in day index with predicate results in' + months[i])
    plt.legend()
    plt.show()





In [None]:
def from_begin(df, begin_day):
    """
    Add a column ["numDays"] into the pandas dataframe that
    counts the number of days from the begin_day
    """
    start_date_col = pd.to_datetime([begin_day] * df.shape[0], format='%Y/%m/%d')
    df['dayIndex'] = (df['date'] - start_date_col).dt.days.astype(float)
    return df.reset_index(drop=True)

min_date = "2022-01"
max_date = "2023-01"
months = pd.date_range(start=min_date, end=max_date, freq='MS')
months = months.strftime("%Y-%m").tolist()

Jan = []
Feb = []
Mar = []
Apr = []
May = []
Jun = []
Jul = []
Aug = []
Sep = []
Oct = []
Nov = []
Dec = []




for i in range(0, len(months), +1):
    start_rec = months[i]
    end_rec = months[-1]
    resu_data = data[data['date'] >= months[i]]
    resu_data = resu_data[resu_data['date'] <= months[i+1]]
    resu_data = from_begin(resu_data, resu_data['date'].min())
    #print(resu_data)

    lr = LinearRegression()
    x_train = resu_data['dayIndex']
    y_train = resu_data['Total_Antarctic']

    model = lr.fit(np.array(x_train).reshape(-1, 1), y_train)

    plt.figure(figsize=(15, 6))
    plt.scatter(x_train, y_train, color = "blue", label = 'data_range_actual')

    b0 = model.intercept_
    b1 = model.coef_[0]
    x_pred = [x_train.min(), x_train.max()]                      # get the bounds for x
    y_pred = model.predict(np.array(x_pred).reshape(-1, 1))  # predict the y values
    plt.plot(x_pred, y_pred, color='red', label = 'model_predict')
    #get season
    if i == 0:
        season = 'Jan'
    elif i == 1:
        season = 'Feb'
    elif i == 2:
        season = 'Mar'
    elif i == 3:
        season = 'Apr'
    elif i == 4:
        season = 'May'
    elif i == 5:
        season = 'Jun'
    elif i == 6:
        season = 'Jul'
    elif i == 7:
        season = 'Aug'
    elif i == 8:
        season = 'Sep'
    elif i == 9:
        season = 'Oct'
    elif i == 10:
        season = 'Nov'
    elif i == 11:
        season = 'Dec'


    plt.xlabel('Days in a month in ' + season)
    plt.ylabel('Antarctic sea ice extent (10^6 sq km) in ' + season)
    plt.title('Antarctic sea ice extent in day index with predicate results in ' + season)
    plt.legend()
    plt.show()

    R_squared = model.score(np.array(resu_data['dayIndex']).reshape(-1, 1), resu_data['Total_Antarctic'])
    # find root mean squared error
    predict = model.predict(np.array(result_date['dayIndex']).reshape(-1, 1))
    root_mean = np.sqrt(mean_squared_error(result_date['Total_Antarctic'], predict))
    print('Root mean squared error: ', root_mean)
    print('R-squared value: ', R_squared)

- Task 4: Compare and discuss the performance of models developed in Task 1, 2, and 3. (4 marks)

In [None]:
# YOUR CODE HERE



- Task 5 (for bonus marks): Be creative and develop your own models using the data until 2021 and apply it to predict the sea ice extent of the year of 2022. The goal is to develop a model which gives smaller root mean squared errors than those from the models developed above. You can use Arctic data here if it helps (only for this Task in Q3). (5 bonus marks)

In [130]:
# YOUR CODE HERE
