In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns

# Basic Data Processes

This notebook covers information derived from two datasets, one regarding the area of Karoi and one regarding the area of Ndambe. Data was first converted from xlxs format to CSV, and it was then subdivided into individual sheets (karoi_A, karoi_B, karoi_C, karoi_D, and karoi_E) pertaining to each area of investigation. This allows for uniform statistical metrics and assurance that numerical data is not being skewed by an unrelated variable. Each CSV file was assigned to a variable, so data from each can be accessed by calling the variable name rather than accessing the file path. 

To cross-examine variables, sheets can be re-joined via their variable names and recalculated. Also, all NaN values were filled with a 0 to avoid a skew from a non-numeric null value. This was accomplished using the .fillna(0) function.

Imported libraries include Pandas, NumPy, and Seaborn for statistical and modeling purposes. 



In [2]:
pip install Pillow

Note: you may need to restart the kernel to use updated packages.


In [3]:
 from PIL import Image

In [4]:
data_karoiA = pd.read_csv("/Users/aliably/Desktop/Karoi_A.csv")

#personal information, karoi_A

FileNotFoundError: [Errno 2] No such file or directory: '/Users/aliably/Desktop/Karoi_A.csv'

In [None]:
data_karoiA

In [None]:
data_karoiA.shape

In [None]:
data_karoiA.count()
#shows number of replies in each column

In [None]:
data_karoiA.dtypes

In [None]:
data_karoiA.apply('nunique')

# Missingno (MSNO)

Missingno is a toolset of data visualization softwares that show maximum and minimum nullity in a dataset. In other words, it shows where data is most sparce and where most null values will likely be found. This provides direction for future investigation as to where more data may be required. A peak in the line shown on the right side of each model indicates a dip in data entries.

The patterns identified within these models are not typically included in a final product, as they do not show overarching information as to the correlation between variables or an indicator of overall connection to the target. Rather, they are useful on the backend to see where models may come up sparce or underfit (lacking in data within a training and testing set) for a machine learning model. 

In [None]:
import missingno as msno

msno.matrix(data_karoiA);

In [None]:
#agricultural services, karoi_B
data_karoiB = pd.read_csv("/Users/aliably/Desktop/Karoi_B.csv")


In [None]:
data_karoiB

In [None]:
data_karoiB.shape

In [None]:
data_karoiB.count()
#shows number of replies in each column

In [None]:
data_karoiB.dtypes

In [None]:
data_karoiB.apply('nunique')

In [None]:
msno.matrix(data_karoiB);

In [None]:
#climate change perception (S, D, U, A, G), karoi_C
data_karoiC = pd.read_csv("/Users/aliably/Desktop/Karoi_C.csv")


In [None]:
data_karoiC

In [None]:
data_karoiC.shape

In [None]:
data_karoiC.count()
#shows number of replies in each column

In [None]:
data_karoiC.dtypes

In [None]:
data_karoiC.apply('nunique')

In [None]:
msno.matrix(data_karoiC);

In [None]:
#impact of climate change, karoi_D
data_karoiD = pd.read_csv("/Users/aliably/Desktop/Karoi_D.csv")


In [None]:
data_karoiD

In [None]:
data_karoiD.shape

In [None]:
data_karoiD.count()
#shows number of replies in each column

In [None]:
data_karoiD.dtypes

In [None]:
data_karoiD.apply('nunique')

In [None]:
msno.matrix(data_karoiD);

In [None]:
#Indigenous Knowledge, Perception & Adaptation, karoi_E

data_karoiE = pd.read_csv("/Users/aliably/Desktop/Karoi_E.csv")


In [None]:
data_karoiE

In [None]:
data_karoiE.shape

In [None]:
data_karoiE.count()
#shows number of replies in each column

In [None]:
data_karoiE.dtypes

In [None]:
data_karoiE.apply('nunique')

In [None]:
msno.matrix(data_karoiE);

# Fill NaNs

NaN values are filled with 0s using .fillna(0), and each dataframe is reprinted.

In [None]:
data_karoiA.fillna(0)

In [None]:
data_karoiB.fillna(0)

In [None]:
data_karoiC.fillna(0)

In [None]:
data_karoiD.fillna(0)

In [None]:
data_karoiE.fillna(0)

# Seaborn Visualization

Seaborn is a Python library that builds on Matplotlib (another statistical and graphical software application) that is able to be closely integrated with structures in Pandas. 

In [None]:
from matplotlib import rcParams

#matplotlib package that allows for custom parameter tuning in visuals

In [None]:
sns.countplot(y = data_karoiA['2. Age (number)']).set(title = 'Karoi Age Values')

rcParams['figure.figsize'] = 14,10

In [None]:
pd.cut(data_karoiA['2. Age (number)'], bins=4)

In [None]:
sns.countplot(pd.cut(data_karoiA['2. Age (number)'], bins=4))


# Summary Statistics

In descriptive statistics, summary statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible.

In [None]:
joined_CSV = pd.read_csv('/Users/aliably/Desktop/joined_CSV.csv')

#rejoin all separate CSV files

In [None]:
joined_CSV.fillna(0)

In [None]:
joined_CSV.apply('nunique')

In [None]:
joined_CSV['Age (number)'].mean()

In [None]:
joined_CSV['Agriculture experience (number)'].mean()

In [None]:
#count the specific occurences of C,T,S,O, or C/T in the Farming Practice column. 

joined_CSV['Farming practice (C, T, S, O)'].value_counts()

In [None]:
msno.matrix(joined_CSV);

# Cross Examination of Variables

The plots below utilize Matplotlib's Pyplot to observe distribution trends between a minimum of two variables.

In [None]:
import matplotlib.pyplot as plt

fg = sns.FacetGrid(joined_CSV, hue="Gender (M/F)", aspect=3)
fg.map(sns.kdeplot, "Age (number)", shade=True)
fg.set(xlim=(0, 80))
plt.legend();

In [None]:
fg = sns.FacetGrid(joined_CSV, hue="Gender (M/F)", aspect=3)
fg.map(sns.kdeplot, "Agriculture experience (number)", shade=True)
fg.set(xlim=(0, 80))
plt.legend();

In [None]:
prop_trained_k = (joined_CSV['Received training (N, Y)'] == 'Y').value_counts()

prop_trained_k

In [None]:
prop_women = (joined_CSV['Gender (M/F)'] == 'F') & (joined_CSV['Received training (N, Y)'] == 'Y')

prop_women

In [None]:
prop_men = (joined_CSV['Gender (M/F)'] == 'M') & (joined_CSV['Received training (N, Y)'] == 'Y')

prop_men

In [None]:
print ("The original list is : " +  str(prop_women))
  
# using enumerate() + list comprehension
# to return true indices.
res = [i for i, val in enumerate(prop_women) if val]
  
# printing result
print ("The list indices having True values are : " +  str(res))

In [None]:
sns.countplot(y = prop_women).set(title = 'Proportion of Women (Karoi) Who Have Received Training')

rcParams['figure.figsize'] = 14,10

In [None]:
print ("The original list is : " +  str(prop_men))
  
res = [i for i, val in enumerate(prop_men) if val]
  
print ("The list indices having True values are : " +  str(res))

In [None]:
sns.countplot(y = prop_men).set(title = 'Proportion of Men (Karoi) Who Have Received Training')

rcParams['figure.figsize'] = 14,10

So, 49 women have received the necessary training, and 51 men have received training. This leaves 31 participants that either have not received training or did not offer a response. Thus, 76% of sampled participants have received training.

______________________________

# Dataset 2: Ndambe

In [6]:
data_ndambe = pd.read_csv("/Users/aliably/Desktop/Raw Data /Ndambe_CSV.csv")

data_ndambe

Unnamed: 0,Gender (M/F),Age (number),Agriculture experience (number),"Religion (C, M, A, X, O, N)","Education (N, P, S, A)","Farming practice (C, T, S, O) T is cattle and C crop","Contacted extention officers (N, Y)","Received training (N, Y)",Winter is now short,Hot and dry season is now long,...,Forest resources are more difficult to collect,Incidence of pests in crops is unchanged,Climate variability affected crop productivity,"What causes climate change? (G, R, D, C, O)","How did you know about climate change? (R, N, P, O, A)","How do you know when to plant crops? (R, E, T, P)","What TBI to predict when to plant crops? (F, B, I, P, O)","What TBI do you use to predict droughts? (F, B, I, P, O)","Who taught you the TBI? (F, P, G, V, O)",What are you doing to reduce effects of climate change?
0,F,56.0,33.0,C,P,T,Y,Y,G,G,...,A,D,G,R,O,T,P,P,G,"Boreholes, dry planting"
1,F,50.0,38.0,C,P,S,Y,Y,U,U,...,G,D,G,R,P,E,O,P,G,Dry planting
2,M,55.0,,O,P,T,Y,Y,G,G,...,G,D,G,R,O,T,P,P,G,Drought resistant crops
3,M,28.0,16.0,C,P,S,N,N,U,U,...,G,D,G,R,P,T,P,P,G,Buying food
4,F,34.0,10.0,C,A,O,Y,Y,S,G,...,U,D,A,O,O,R,P,P,V,Planting drought resistant crops.
5,F,39.0,20.0,,P,S,Y,Y,G,G,...,G,D,G,R,O,T,P,P,G,Construction of dams
6,F,34.0,15.0,C,P,S,Y,Y,G,G,...,S,A,S,R,O,T,P,P,V,drought resistant crops
7,F,56.0,35.0,O,N,O,Y,Y,G,G,...,G,D,G,R,O,T,P,P,G,drought resistance crops
8,F,52.0,35.0,O,S,T,Y,Y,G,G,...,G,D,G,R,O,T,P,P,G,construction of dams and boreholes
9,F,,,N,N,S,,,G,G,...,S,D,G,R,O,,P,P,G,DAMS


In [7]:
data_ndambe.shape

(58, 35)

In [8]:
data_ndambe.fillna(0)

Unnamed: 0,Gender (M/F),Age (number),Agriculture experience (number),"Religion (C, M, A, X, O, N)","Education (N, P, S, A)","Farming practice (C, T, S, O) T is cattle and C crop","Contacted extention officers (N, Y)","Received training (N, Y)",Winter is now short,Hot and dry season is now long,...,Forest resources are more difficult to collect,Incidence of pests in crops is unchanged,Climate variability affected crop productivity,"What causes climate change? (G, R, D, C, O)","How did you know about climate change? (R, N, P, O, A)","How do you know when to plant crops? (R, E, T, P)","What TBI to predict when to plant crops? (F, B, I, P, O)","What TBI do you use to predict droughts? (F, B, I, P, O)","Who taught you the TBI? (F, P, G, V, O)",What are you doing to reduce effects of climate change?
0,F,56.0,33.0,C,P,T,Y,Y,G,G,...,A,D,G,R,O,T,P,P,G,"Boreholes, dry planting"
1,F,50.0,38.0,C,P,S,Y,Y,U,U,...,G,D,G,R,P,E,O,P,G,Dry planting
2,M,55.0,0.0,O,P,T,Y,Y,G,G,...,G,D,G,R,O,T,P,P,G,Drought resistant crops
3,M,28.0,16.0,C,P,S,N,N,U,U,...,G,D,G,R,P,T,P,P,G,Buying food
4,F,34.0,10.0,C,A,O,Y,Y,S,G,...,U,D,A,O,O,R,P,P,V,Planting drought resistant crops.
5,F,39.0,20.0,0,P,S,Y,Y,G,G,...,G,D,G,R,O,T,P,P,G,Construction of dams
6,F,34.0,15.0,C,P,S,Y,Y,G,G,...,S,A,S,R,O,T,P,P,V,drought resistant crops
7,F,56.0,35.0,O,N,O,Y,Y,G,G,...,G,D,G,R,O,T,P,P,G,drought resistance crops
8,F,52.0,35.0,O,S,T,Y,Y,G,G,...,G,D,G,R,O,T,P,P,G,construction of dams and boreholes
9,F,0.0,0.0,N,N,S,0,0,G,G,...,S,D,G,R,O,0,P,P,G,DAMS


In [9]:
data_ndambe.count()
#shows number of replies in each column

Gender (M/F)                                                58
Age (number)                                                57
Agriculture experience (number)                             55
Religion (C, M, A, X, O, N)                                 47
Education (N, P, S, A)                                      57
Farming practice (C, T, S, O) T is cattle and C crop        46
Contacted extention officers (N, Y)                         55
Received training (N, Y)                                    55
Winter is now short                                         58
Hot and dry season is now long                              58
Atmospheric temperature increased                           58
Rain season is now short                                    57
Rainfall is now erratic                                     58
*Local weather has not changed                              58
Frequency of drought has increased                          57
*Flooding frequency unchanged                          

In [10]:
data_ndambe.dtypes

Gender (M/F)                                                 object
Age (number)                                                float64
Agriculture experience (number)                             float64
Religion (C, M, A, X, O, N)                                  object
Education (N, P, S, A)                                       object
Farming practice (C, T, S, O) T is cattle and C crop         object
Contacted extention officers (N, Y)                          object
Received training (N, Y)                                     object
Winter is now short                                          object
Hot and dry season is now long                               object
Atmospheric temperature increased                            object
Rain season is now short                                     object
Rainfall is now erratic                                      object
*Local weather has not changed                               object
Frequency of drought has increased              

In [11]:
data_ndambe.apply('nunique')

Gender (M/F)                                                 2
Age (number)                                                32
Agriculture experience (number)                             21
Religion (C, M, A, X, O, N)                                  4
Education (N, P, S, A)                                       4
Farming practice (C, T, S, O) T is cattle and C crop         4
Contacted extention officers (N, Y)                          2
Received training (N, Y)                                     2
Winter is now short                                          5
Hot and dry season is now long                               5
Atmospheric temperature increased                            3
Rain season is now short                                     2
Rainfall is now erratic                                      5
*Local weather has not changed                               5
Frequency of drought has increased                           4
*Flooding frequency unchanged                          

In [12]:
msno.matrix(data_ndambe);

NameError: name 'msno' is not defined

Through using the 'concat' function, age values from both datasets are being joined to allow for the cross-examination of age-related trends in different geographical locations.

In [13]:
age_join = pd.concat([joined_CSV['Age (number)'], data_ndambe['Age (number)']], axis=1, 
                     keys=['Karoi Ages', 'Ndambe Ages'])

NameError: name 'joined_CSV' is not defined

In [14]:
age_join

NameError: name 'age_join' is not defined

In [None]:
karoi_mean = age_join['Karoi Ages'].mean()

karoi_mean

In [None]:
ndambe_mean = age_join['Ndambe Ages'].mean()

ndambe_mean

In [None]:
sns.countplot(y = data_ndambe['Age (number)']).set(title = 'Age Values Ndambe')

rcParams['figure.figsize'] = 14,10

In [15]:
prop_women_n = (data_ndambe['Gender (M/F)'] == 'F') & (data_ndambe['Received training (N, Y)'] == 'Y')

prop_women_n

0      True
1      True
2     False
3     False
4      True
5      True
6      True
7      True
8      True
9     False
10    False
11     True
12    False
13    False
14    False
15    False
16    False
17     True
18     True
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28     True
29    False
30     True
31     True
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41     True
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
50    False
51    False
52    False
53     True
54    False
55     True
56    False
57    False
dtype: bool

In [None]:
sns.countplot(y = prop_women_n).set(title = 'Proportion of Women (Ndambe) Who Have Received Training')

rcParams['figure.figsize'] = 14,10

In [None]:
prop_trained = (data_ndambe['Received training (N, Y)'] == 'Y').value_counts()

prop_trained

In [None]:
prop_men_n = (data_ndambe['Gender (M/F)'] == 'M') & (data_ndambe['Received training (N, Y)'] == 'Y')

prop_men_n

In [None]:
sns.countplot(y = prop_men_n).set(title = 'Proportion of Men (Ndambe) Who Have Received Training')

rcParams['figure.figsize'] = 14,10

In [None]:
print ("The original list is : " +  str(prop_men_n))
  
res = [i for i, val in enumerate(prop_men_n) if val]
  
print ("The list indices having True values are : " +  str(res))

In [None]:
print ("The original list is : " +  str(prop_women_n))
  
res = [i for i, val in enumerate(prop_women_n) if val]
  
print ("The list indices having True values are : " +  str(res))

So, 16 women have received training, and 5 men have received training. 38% of all sampled participants have been trained. 

In [None]:
fg = sns.FacetGrid(data_ndambe, hue="Gender (M/F)", aspect=3)
fg.map(sns.kdeplot, "Age (number)", shade=True)
fg.set(xlim=(0, 80))
plt.legend();

In [None]:
fg = sns.FacetGrid(data_ndambe, hue="Gender (M/F)", aspect=3)
fg.map(sns.kdeplot, "Agriculture experience (number)", shade=True)
fg.set(xlim=(0, 80))
plt.legend();

## Recap: What Have We Learned So Far?

1. 76% of Karoi participants have received necessary training versus 38% of Ndambe participants. 
2. The mean age in Karoi is lower (36.5) than that of Ndambe (52.9). 
3. In Ndambe, more women received training than men, but in Karoi, more men were trained, though by a slim margin. 
4. In Karoi, age for women and men is relatively uniform, though men have slightly more agricultural experience. In Ndambe, the mean age of women is lower, and, comparatively, they have more agricultural experience. 
5. Data regarding farming practices and what individuals are doing to reduce the impacts of climate change are the most sparse. 

# Working With Categorical Variables

One-hot encoding is the process of replacing categorical variables with binary, True/False numerical values. As many machine learning models are algebraic, they rely on numerical data. So, making a numerical copy of the categorical data early on opens up for the consideration of future implementation. This process will not disturb the original data, so all categorical features will still remain intact.

In [None]:
joined_CSV = joined_CSV.fillna(0)
joined_CSV

In [None]:
joined_CSV['Religion (C, M, A, X, O, N)'].value_counts()

In [None]:
sns.countplot(y = joined_CSV['Religion (C, M, A, X, O, N)']).set(title = 'Religious Values in Karoi')

rcParams['figure.figsize'] = 14,10

In [None]:
data_ndambe = data_ndambe.fillna(0)
data_ndambe

In [None]:
data_ndambe['Religion (C, M, A, X, O, N)'].value_counts()

In [None]:
sns.countplot(y = data_ndambe['Religion (C, M, A, X, O, N)']).set(title = 'Religious Values in Ndambe')

rcParams['figure.figsize'] = 14,10