<a href="https://colab.research.google.com/github/Di-anaBF/Cropland-Mapping/blob/main/KenyaCrop_Type_Process_Labels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Process Labels

**Script Author**: Ivan Zvonkov
**Country Specific Upadate (Ghana)**: Diana B. Frimpong

**Description**: Cleans labels labeled in Collect Earth Online.

Important: for efficiency purposes this is not the same CEO set that we created in module 2. This has more points so only a portion will be used for accuracy assessment.

**Prerequisite**:
1. Download the "sample-data" csv files from [here](https://drive.google.com/drive/folders/1opPnDtjGwD8WfF1YXt5AJMtQMGYKiZTO?usp=drive_link)
2. Open the Files tab on the left.
3. Upload the "sample-data" csv files from your computer to Colab by dragging them into the Files area on the left.


## 1. Load reference label sets

In [None]:
import pandas as pd

In [None]:
# Load in csv files from Collect Earth Online sets
df1 = pd.read_csv("/content/ceo-Ghana---Stratified-Sample-2019-(Set-1)-sample-data-2024-03-13.csv")
df2 = pd.read_csv("/content/ceo-Ghana---Stratified-Sample-2019-(Set-2)-sample-data-2024-03-13.csv")

In [None]:
len(df1), len(df2)

(463, 463)

In [None]:
# TASK 1: Add code to analyze contents of a single row from data frame 1
#########################
# Your code below


#########################

In [None]:
# Plot the points
import plotly.express as px

fig = px.scatter_mapbox(df1, lat='lat',lon='lon', hover_name="plotid", zoom=6)
fig.update_layout(mapbox_style="open-street-map")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})

## 2. Clean Reference Data

In [None]:
# Stack all two dataframes on top of one another
# so they can be simultaneously analyzed
all_dfs = pd.concat([df1, df2])
len(all_dfs)

926

In [None]:
from datetime import date

all_dfs["start_date"] = date(2019, 1, 1)
all_dfs["end_date"] = date(2020, 1, 1)

In [None]:
# TASK 2: Check if there are there any rows with no label? If yes how many?
#########################
# Your code below



#########################

In [None]:
# Remove rows where no label has been added
all_dfs_clean = all_dfs[~all_dfs["Does this pixel contain an active cropland"].isna()].copy()

In [None]:
# Convert the label to a number for easier processing
all_dfs_clean["is_crop"] = (all_dfs_clean["Does this pixel contain an active cropland"] == "crop").astype(int)

In [None]:
all_dfs_clean["is_crop"].value_counts()

0    776
1    150
Name: is_crop, dtype: int64

## 3. Process Labeler Agreement

In [None]:
# Create column for keeping track of labelers
all_dfs_clean["num_labeler"] = 1

In [None]:
# Combine all rows which have the same latitude and longitude
df = all_dfs_clean.groupby(
    ["lon", "lat", "start_date", "end_date"],
    as_index=False,
    sort=False
).agg({"is_crop": "mean", "num_labeler": "sum"})

In [None]:
# Analyze distribution of labels
df[["is_crop", "num_labeler"]].value_counts()

is_crop  num_labeler
0.0      2              345
0.5      2               86
1.0      2               32
dtype: int64

In [None]:
# Remove all points with equal disagreement
df = df[df["is_crop"] != 0.5].copy().reset_index()

In [None]:
# Round the crop values e.g. 0.3333 becomes 0.0.
df["is_crop"] = (df["is_crop"] > 0.5).astype(int)

In [None]:
# TASK 3: How many crops are there in the cleaned label set?
#########################
# Your code below



#########################

## 4. Split into training and test sets

Since this is set is larger

In [None]:
# Split into train and test sets
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=120, random_state=0)


In [None]:
len(train_df), len(test_df)

(257, 120)

In [None]:
train_df.to_csv("train-ghananew.csv", index=False)
test_df.to_csv("test-ghananew.csv", index=False)