# Introduction: Analyzing the Likely Region Represented by the Dataset

## Objective

The purpose of this project is to analyze a weather dataset to determine if it plausibly represents a specific region of the world, based on its temperature patterns and rainfall frequency. The dataset shows a notably high percentage of days with rain (approximately 88.4%), which suggests that it might correspond to a region known for frequent rainfall. By comparing the dataset's characteristics with those of several high-rainfall regions, we aim to assess the likelihood of a match.

## Methodology

### 1. Data Inspection and Preparation

The analysis began by inspecting the dataset using various command-line tools to understand its structure and content:
- **Head Command (`!head`)**: Displayed the first 10 rows of the CSV file to provide a snapshot of the data.
- **Word Count Command (`!wc`)**: Counted the total number of rows in the dataset, offering an understanding of the dataset's size.
- **AWK Command (`!awk`)**: Identified the number of columns in the dataset, which confirmed the consistency of data formatting. `AWK` was also used to calculate temperature averages and convert them to Fahrenheit.
- **Grep Command (`!grep`)**: Located and counted rows containing the word 'rain' to quantify the percentage of rainy days in the dataset.

### 2. Data Conversion and Aggregation

To facilitate comparison with real-world data:
- The **`AWK`** tool was utilized to convert temperatures from Celsius to Fahrenheit.
- Calculated monthly average temperatures and the overall average rainfall percentage, using both **`AWK`** commands and Python's **Pandas** library.

### 3. Comparison with Potential Regions

To identify the most likely match, the dataset was compared with three regions known for high rainfall: 
1. **Amazon Rainforest**
2. **Cherapunji, India**
3. **Yakushima Island, Japan**

The comparison involved:
- Compiling a table using Python's **Pandas** library that included monthly average temperatures and rainfall percentages for each of these regions.
- Calculating the variance between the dataset and each region for both temperature and rainfall to identify which region had the smallest variance, suggesting the closest match.

### 4. Analysis and Interpretation

The variance analysis revealed:
- **Yakushima Island, Japan** showed the most similarity in temperature patterns, particularly during July.
- **Cherapunji, India** had the closest match in rainfall percentage but a larger temperature variance.
- The **Amazon Rainforest** had less alignment with the dataset in both temperature and rainfall frequency.




In [None]:
# Display the first 10 lines of the CSV file
!head -n 10 weather_history.csv


Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222222222220,7.3888888888888900,0.89,14.1197,251.0,15.826300000000000,0.0,1015.13,Partly cloudy throughout the day.
2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355555555555560,7.227777777777780,0.86,14.2646,259.0,15.826300000000000,0.0,1015.63,Partly cloudy throughout the day.
2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377777777777780,9.377777777777780,0.89,3.9284000000000000,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.28888888888889,5.944444444444450,0.83,14.1036,269.0,15.826300000000000,0.0,1016.41,Partly cloudy throughout the day.
2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755555555555550,6.977777777777780,0.83,11.0446,259.0,15.826300000000000,0.0,1016.51,P

In [None]:
# Count the total number of rows in the dataset
!wc -l weather_history.csv

# Check the number of columns to understand data structure
!head -n 1 weather_history.csv | awk -F',' '{print NF}'


   96453 weather_history.csv
12


In [None]:
# Count the total number of matching rows
!grep -c 'rain' weather_history.csv

# Display the first 5 matching rows with no broken pipe message
!grep -m 5 'rain' weather_history.csv


85267
2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222222222220,7.3888888888888900,0.89,14.1197,251.0,15.826300000000000,0.0,1015.13,Partly cloudy throughout the day.
2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355555555555560,7.227777777777780,0.86,14.2646,259.0,15.826300000000000,0.0,1015.63,Partly cloudy throughout the day.
2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377777777777780,9.377777777777780,0.89,3.9284000000000000,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.28888888888889,5.944444444444450,0.83,14.1036,269.0,15.826300000000000,0.0,1016.41,Partly cloudy throughout the day.
2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755555555555550,6.977777777777780,0.83,11.0446,259.0,15.826300000000000,0.0,1016.51,Partly cloudy throughout the day.


In [None]:
total_rows = 96453
rain_rows = 85267
percentage_rain = (rain_rows / total_rows) * 100
print(f"Percentage of days with rain: {percentage_rain:.2f}%")


Percentage of days with rain: 88.40%


In [None]:
!awk -F ',' '{if (NR > 1) {sum += ($4 * 9/5) + 32; count++}} END {avg = sum/count; printf "Average Temperature (F): %.2f\n", avg}' weather_history.csv


Average Temperature (F): 53.48


In [None]:
!awk -F ',' 'NR > 1 {temp = ($4 * 9/5) + 32; if (temp > max) max = temp; if (temp < min || min == "") min = temp} END {printf "Min Temperature (F): %.2f\nMax Temperature (F): %.2f\n", min, max}' weather_history.csv


Min Temperature (F): -7.28
Max Temperature (F): 103.83


In [None]:
!awk -F ',' 'NR > 1 {split($1, date, "-"); month = date[1]"-"date[2]; temp = ($4 * 9/5) + 32; temp_sum[month] += temp; temp_count[month]++} END {for (m in temp_sum) {printf "%s Average Temp (F): %.2f\n", m, temp_sum[m]/temp_count[m]}}' weather_history.csv


2012-02 Average Temp (F): 22.73
2012-03 Average Temp (F): 46.05
2015-10 Average Temp (F): 51.12
2012-04 Average Temp (F): 54.62
2015-11 Average Temp (F): 44.02
2012-05 Average Temp (F): 62.89
2015-12 Average Temp (F): 36.80
2012-06 Average Temp (F): 71.94
2012-07 Average Temp (F): 76.23
2012-08 Average Temp (F): 74.60
2012-09 Average Temp (F): 67.10
2011-01 Average Temp (F): 32.29
2011-02 Average Temp (F): 31.71
2011-03 Average Temp (F): 42.63
2014-10 Average Temp (F): 54.91
2011-04 Average Temp (F): 56.05
2014-11 Average Temp (F): 46.11
2011-05 Average Temp (F): 62.33
2011-06 Average Temp (F): 70.08
2014-12 Average Temp (F): 37.78
2011-07 Average Temp (F): 70.83
2009-01 Average Temp (F): 29.89
2011-08 Average Temp (F): 73.30
2009-02 Average Temp (F): 35.30
2011-09 Average Temp (F): 68.06
2009-03 Average Temp (F): 44.00
2010-01 Average Temp (F): 29.63
2009-04 Average Temp (F): 58.21
2010-02 Average Temp (F): 34.58
2009-05 Average Temp (F): 64.16
2010-03 Average Temp (F): 44.07
2009-06 

In [None]:
!awk -F ',' 'NR > 1 {split($1, date_parts, "-"); month = date_parts[2]; temp_f = ($4 * 9/5) + 32; sum[month] += temp_f; count[month]++} END {for (m in sum) {printf "Month: %02d, Average Temp (F): %.2f\n", m, sum[m] / count[m]}}' weather_history.csv | sort


Month: 01, Average Temp (F): 33.47
Month: 02, Average Temp (F): 35.89
Month: 03, Average Temp (F): 44.43
Month: 04, Average Temp (F): 54.96
Month: 05, Average Temp (F): 62.37
Month: 06, Average Temp (F): 69.29
Month: 07, Average Temp (F): 73.34
Month: 08, Average Temp (F): 72.22
Month: 09, Average Temp (F): 63.53
Month: 10, Average Temp (F): 52.42
Month: 11, Average Temp (F): 43.86
Month: 12, Average Temp (F): 34.94


In [None]:
# Import necessary libraries
import pandas as pd

# Your dataset monthly average temperatures and rainfall percentage
your_data = {
    'Month': ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'],
    'Your Avg Temp (F)': [33.47, 35.89, 44.43, 54.96, 62.37, 69.29, 73.34, 72.22, 63.53, 52.42, 43.86, 34.94],
    'Your Rainfall %': [88.4] * 12
}

# Data for three potential matching regions
regions_data = {
    'Region': ['Amazon Rainforest', 'Amazon Rainforest', 'Cherapunji, India', 'Cherapunji, India', 'Yakushima Island, Japan', 'Yakushima Island, Japan'],
    'Month': ['January', 'July', 'January', 'July', 'January', 'July'],
    'Avg Temp (F)': [75.0, 80.0, 55.0, 78.0, 45.0, 75.0],
    'Rainfall %': [85, 85, 90, 90, 85, 85]
}

# Convert data into DataFrames
your_df = pd.DataFrame(your_data)
regions_df = pd.DataFrame(regions_data)

# Merge data for analysis
merged_df = your_df.merge(regions_df, on='Month', how='inner')

# Calculate temperature and rainfall variances
merged_df['Temp Variance'] = merged_df['Your Avg Temp (F)'] - merged_df['Avg Temp (F)']
merged_df['Rainfall Variance'] = merged_df['Your Rainfall %'] - merged_df['Rainfall %']

# Display results
print("Comparison of Your Dataset with Potential Regions:")
print(merged_df[['Month', 'Region', 'Your Avg Temp (F)', 'Avg Temp (F)', 'Temp Variance', 'Your Rainfall %', 'Rainfall %', 'Rainfall Variance']])


Comparison of Your Dataset with Potential Regions:
     Month                   Region  Your Avg Temp (F)  Avg Temp (F)  \
0  January        Amazon Rainforest              33.47          75.0   
1  January        Cherapunji, India              33.47          55.0   
2  January  Yakushima Island, Japan              33.47          45.0   
3     July        Amazon Rainforest              73.34          80.0   
4     July        Cherapunji, India              73.34          78.0   
5     July  Yakushima Island, Japan              73.34          75.0   

   Temp Variance  Your Rainfall %  Rainfall %  Rainfall Variance  
0         -41.53             88.4          85                3.4  
1         -21.53             88.4          90               -1.6  
2         -11.53             88.4          85                3.4  
3          -6.66             88.4          85                3.4  
4          -4.66             88.4          90               -1.6  
5          -1.66             88.4         

In [None]:
import pandas as pd

# Data for your dataset and potential regions
data = {
    'Month': ['January', 'January', 'January', 'July', 'July', 'July'],
    'Region': ['Amazon Rainforest', 'Cherapunji, India', 'Yakushima Island, Japan', 
               'Amazon Rainforest', 'Cherapunji, India', 'Yakushima Island, Japan'],
    'Your Avg Temp (F)': [33.47, 33.47, 33.47, 73.34, 73.34, 73.34],
    'Avg Temp (F)': [75.0, 55.0, 45.0, 80.0, 78.0, 75.0],
    'Temp Variance': [-41.53, -21.53, -11.53, -6.66, -4.66, -1.66],
    'Your Rainfall %': [88.4, 88.4, 88.4, 88.4, 88.4, 88.4],
    'Rainfall %': [85, 90, 85, 85, 90, 85],
    'Rainfall Variance': [3.4, -1.6, 3.4, 3.4, -1.6, 3.4]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Apply styling to the DataFrame
df_styled = df.style.set_table_styles(
    [{'selector': 'th', 'props': [('font-size', '12pt'), ('text-align', 'center'), ('font-weight', 'bold')]},
     {'selector': 'td', 'props': [('font-size', '12pt'), ('text-align', 'center')]}]
).set_properties(**{'border': '1px solid black', 'border-collapse': 'collapse'})

# Display the styled DataFrame
df_styled


Unnamed: 0,Month,Region,Your Avg Temp (F),Avg Temp (F),Temp Variance,Your Rainfall %,Rainfall %,Rainfall Variance
0,January,Amazon Rainforest,33.47,75.0,-41.53,88.4,85,3.4
1,January,"Cherapunji, India",33.47,55.0,-21.53,88.4,90,-1.6
2,January,"Yakushima Island, Japan",33.47,45.0,-11.53,88.4,85,3.4
3,July,Amazon Rainforest,73.34,80.0,-6.66,88.4,85,3.4
4,July,"Cherapunji, India",73.34,78.0,-4.66,88.4,90,-1.6
5,July,"Yakushima Island, Japan",73.34,75.0,-1.66,88.4,85,3.4


# Conclusion

### Interpretation

1. **Temperature Comparison**:
   - The "Dataset Avg Temp (F)" column represents the average temperatures recorded in the dataset for January and July. The "Avg Temp (F)" column shows the average temperatures for the same months in three potential regions: the Amazon Rainforest, Cherapunji (India), and Yakushima Island (Japan).
   - **Temperature Variance**: The "Temp Variance" column calculates the difference between the dataset's average temperature and the average temperature of each potential region. 
     - For example, for January in the Amazon Rainforest:  
     \[
     \text{Temp Variance} = \text{Dataset Avg Temp (F)} - \text{Avg Temp (F)} = 33.47 - 75.0 = -41.53
     \]
   - This negative variance indicates that the dataset's January temperature is significantly lower than the January temperature in the Amazon Rainforest. Similarly, variances are calculated for other regions and months.

2. **Rainfall Percentage Comparison**:
   - The "Dataset Rainfall %" column shows the average rainfall percentage in the dataset (88.4%), while the "Rainfall %" column represents the average rainfall percentage for the potential regions.
   - **Rainfall Variance**: The "Rainfall Variance" is the difference between the dataset's rainfall percentage and the rainfall percentage of each region. 
     - For example, for January in Cherapunji, India:
     \[
     \text{Rainfall Variance} = \text{Dataset Rainfall %} - \text{Rainfall %} = 88.4 - 90 = -1.6
     \]
   - A negative variance shows that the rainfall percentage in Cherapunji is slightly higher than in the dataset for January. Positive variances indicate the opposite.

3. **Potential Matches**:
   - **Amazon Rainforest**: This region has a relatively close rainfall variance (3.4% difference) but a large negative temperature variance in both January (-41.53°F) and July (-6.66°F). This suggests that while the rainfall frequency might match, the average temperatures do not align well.
   - **Cherapunji, India**: This region has a slightly negative rainfall variance (-1.6% difference) and a smaller negative temperature variance in July (-4.66°F) compared to January (-21.53°F). While the July temperatures are somewhat close, the January temperatures are significantly different.
   - **Yakushima Island, Japan**: This region has a close temperature variance in July (-1.66°F) and a slightly higher rainfall variance (3.4% difference) than the dataset. However, the January temperature is still much cooler than in the dataset.

### Conclusion

- The temperature and rainfall data of the dataset do not exactly match any of the three potential regions. 
- **Yakushima Island, Japan**, appears to be the closest match in terms of temperature (especially in July) and has a manageable variance in rainfall percentage.
- **Cherapunji, India**, has a close rainfall percentage but shows a significant variance in temperature, especially in January.
- **Amazon Rainforest** has a closer rainfall variance, but the temperatures do not align, particularly in January.

### Mathematical Summary

- **Percent of Rainfall in the Dataset**: 
  \[
  \text{Rainfall Percentage} = \frac{85,267 \text{ days of rain}}{96,453 \text{ total days}} \times 100 \approx 88.4\%
  \]
- **Temperature Variance Calculation**: 
  For each potential match, calculate the variance:
  \[
  \text{Temp Variance} = \text{Dataset Avg Temp (F)} - \text{Region Avg Temp (F)}
  \]
- **Rainfall Variance Calculation**: 
  \[
  \text{Rainfall Variance} = \text{Dataset Rainfall %} - \text{Region Rainfall %}
  \]

This analysis suggests that while no single region perfectly matches the characteristics of the dataset, Yakushima Island, Japan, may be the most comparable option when considering both temperature and rainfall percentages.
