# **Task 8 : Analyze weather data**
Use a dataset of weather data and build a
model that can predict future weather patterns



# Importing Required Libraries
- The code starts by importing essential libraries such as NumPy, Pandas, Matplotlib, Seaborn, and Datetime. These libraries are commonly used in data analysis and visualization tasks.

# Reading Data
- The code loads data from two CSV files, 'city_attributes (1).csv' and 'humidity.csv,' into Pandas DataFrames named 'data' and 'data2,' respectively. These data files likely contain information relevant to weather prediction.

# EDA Process
- The code indicates that it's the beginning of the EDA (Exploratory Data Analysis) process. EDA is a critical phase in understanding and preparing data for predictive modeling, such as weather forecasting.

# Data Cleaning for 'data'
- Data cleaning for the 'data' DataFrame includes the following steps:
  - Checking data types using 'data.dtypes' to identify the types of data in the columns. Understanding data types is essential for data processing.
  - Obtaining the shape of the DataFrame using 'data.shape,' which provides information about the number of rows and columns.
  - Checking for duplicated rows using 'data.duplicated()'. Identifying and handling duplicates is important in data cleaning.
  - Replacing 'Israel' with 'Palestine' in the 'Country' column. This is a data correction step and can be relevant for accurate predictions.
  - Removing the 'Latitude' and 'Longitude' columns from the DataFrame using 'data.drop(columns=['Latitude', 'Longitude'])'. This is a common practice in data cleaning when columns are not needed for analysis.

# Data Cleaning for 'data2'
- Data cleaning for the 'data2' DataFrame includes various steps to prepare the data for analysis:
  - Checking the shape of the DataFrame using 'data2.shape' to see how many data points and features are present.
  - Using 'data2.info()' to get an overview of the DataFrame, including data types and non-null counts. This is essential for understanding the data's quality.
  - Displaying the first 10 rows of the DataFrame with 'data2.head(10)' to inspect sample data.
  - Checking for duplicated rows with 'data2.duplicated()'. Duplicates can affect the accuracy of predictive models.
  - Checking for null values in the DataFrame using 'data2.isnull().sum()'. Handling missing data is crucial for accurate predictions.
  - Removing rows with null values using 'data2.dropna()'. This step ensures that the data is ready for modeling.
  - Changing the data type of the 'datetime' column from 'object' to 'datetime' with 'data2['datetime'] = pd.to_datetime(data2['datetime'])'. This transformation is important for time-series analysis, which is often relevant in weather prediction.

# Data Visualization
- The code doesn't include specific data visualizations. However, visualization is a key component of EDA, and additional code may be needed to create graphs or plots to better understand the data.

# Extracting the Largest Value
- The code uses 'data2.nlargest(1, 'Vancouver')' to extract the largest value from the 'Vancouver' column in the 'data2' DataFrame. This information could be significant for predictive modeling, as it may represent a critical data point for weather prediction in Vancouver.

In the context of weather data prediction, the code conducts data preprocessing and initial data quality checks. Further steps, such as feature engineering, model selection, and training, would likely follow this EDA process to build a predictive model for weather forecasting.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

In [None]:
data = pd.read_csv('/content/city_attributes (1).csv')
data

Unnamed: 0,City,Country,Latitude,Longitude
0,Vancouver,Canada,49.24966,-123.119339
1,Portland,United States,45.523449,-122.676208
2,San Francisco,United States,37.774929,-122.419418
3,Seattle,United States,47.606209,-122.332069
4,Los Angeles,United States,34.052231,-118.243683
5,San Diego,United States,32.715328,-117.157257
6,Las Vegas,United States,36.174969,-115.137222
7,Phoenix,United States,33.44838,-112.074043
8,Albuquerque,United States,35.084492,-106.651138
9,Denver,United States,39.739151,-104.984703


# Process of **EDA** will be for each dataset as following
Clean Data
Process and Analyze Data
Visualize data

*   Clean Data
*   Process and Analyze Data
*   Visualize data






In [None]:
# Data types
data.dtypes

City          object
Country       object
Latitude     float64
Longitude    float64
dtype: object

In [None]:
data.shape

(36, 4)

In [None]:
data.duplicated()

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
dtype: bool

In [None]:
# we want to replace israel to be palestine
data['Country'].replace('Israel', 'Palestine', inplace=True)
data

Unnamed: 0,City,Country,Latitude,Longitude
0,Vancouver,Canada,49.24966,-123.119339
1,Portland,United States,45.523449,-122.676208
2,San Francisco,United States,37.774929,-122.419418
3,Seattle,United States,47.606209,-122.332069
4,Los Angeles,United States,34.052231,-118.243683
5,San Diego,United States,32.715328,-117.157257
6,Las Vegas,United States,36.174969,-115.137222
7,Phoenix,United States,33.44838,-112.074043
8,Albuquerque,United States,35.084492,-106.651138
9,Denver,United States,39.739151,-104.984703


In [None]:
# Remove columns (Latitude , Longitude)

data = data.drop(columns=['Latitude', 'Longitude'])
data

Unnamed: 0,City,Country
0,Vancouver,Canada
1,Portland,United States
2,San Francisco,United States
3,Seattle,United States
4,Los Angeles,United States
5,San Diego,United States
6,Las Vegas,United States
7,Phoenix,United States
8,Albuquerque,United States
9,Denver,United States


In [None]:
data['Country'].unique()

data['City'].unique()

array(['Vancouver', 'Portland', 'San Francisco', 'Seattle', 'Los Angeles',
       'San Diego', 'Las Vegas', 'Phoenix', 'Albuquerque', 'Denver',
       'San Antonio', 'Dallas', 'Houston', 'Kansas City', 'Minneapolis',
       'Saint Louis', 'Chicago', 'Nashville', 'Indianapolis', 'Atlanta',
       'Detroit', 'Jacksonville', 'Charlotte', 'Miami', 'Pittsburgh',
       'Toronto', 'Philadelphia', 'New York', 'Montreal', 'Boston',
       'Beersheba', 'Tel Aviv District', 'Eilat', 'Haifa', 'Nahariyya',
       'Jerusalem'], dtype=object)

In [None]:
data2 = pd.read_csv('/content/humidity.csv')
data2

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
0,2012-10-01 12:00:00,,,,,,,,,,...,,,,,,,25.0,,,
1,2012-10-01 13:00:00,76.0,81.0,88.0,81.0,88.0,82.0,22.0,23.0,50.0,...,71.0,58.0,93.0,68.0,50.0,63.0,22.0,51.0,51.0,50.0
2,2012-10-01 14:00:00,76.0,80.0,87.0,80.0,88.0,81.0,21.0,23.0,49.0,...,70.0,57.0,91.0,68.0,51.0,62.0,22.0,51.0,51.0,50.0
3,2012-10-01 15:00:00,76.0,80.0,86.0,80.0,88.0,81.0,21.0,23.0,49.0,...,70.0,57.0,87.0,68.0,51.0,62.0,22.0,51.0,51.0,50.0
4,2012-10-01 16:00:00,77.0,80.0,85.0,79.0,88.0,81.0,21.0,23.0,49.0,...,69.0,57.0,84.0,68.0,52.0,62.0,22.0,51.0,51.0,50.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26234,2015-09-29 14:00:00,100.0,93.0,77.0,87.0,82.0,83.0,33.0,44.0,39.0,...,73.0,69.0,100.0,83.0,52.0,62.0,23.0,37.0,37.0,62.0
26235,2015-09-29 15:00:00,100.0,87.0,77.0,87.0,83.0,78.0,21.0,34.0,34.0,...,78.0,74.0,88.0,88.0,52.0,69.0,24.0,54.0,54.0,58.0
26236,2015-09-29 16:00:00,93.0,76.0,80.0,82.0,73.0,72.0,19.0,33.0,33.0,...,75.0,65.0,83.0,85.0,39.0,65.0,30.0,54.0,54.0,62.0
26237,2015-09-29 17:00:00,87.0,60.0,86.0,75.0,57.0,62.0,17.0,33.0,32.0,...,72.0,78.0,83.0,80.0,67.0,65.0,33.0,100.0,100.0,70.0


In [None]:
data2.shape

data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26239 entries, 0 to 26238
Data columns (total 37 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   datetime           26239 non-null  object 
 1   Vancouver          25250 non-null  float64
 2   Portland           25789 non-null  float64
 3   San Francisco      26089 non-null  float64
 4   Seattle            25949 non-null  float64
 5   Los Angeles        26094 non-null  float64
 6   San Diego          25936 non-null  float64
 7   Las Vegas          25585 non-null  float64
 8   Phoenix            24988 non-null  float64
 9   Albuquerque        25557 non-null  float64
 10  Denver             24433 non-null  float64
 11  San Antonio        25674 non-null  float64
 12  Dallas             25952 non-null  float64
 13  Houston            26118 non-null  float64
 14  Kansas City        25726 non-null  float64
 15  Minneapolis        25730 non-null  float64
 16  Saint Louis        252

In [None]:
data2.head(10)

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
0,2012-10-01 12:00:00,,,,,,,,,,...,,,,,,,25.0,,,
1,2012-10-01 13:00:00,76.0,81.0,88.0,81.0,88.0,82.0,22.0,23.0,50.0,...,71.0,58.0,93.0,68.0,50.0,63.0,22.0,51.0,51.0,50.0
2,2012-10-01 14:00:00,76.0,80.0,87.0,80.0,88.0,81.0,21.0,23.0,49.0,...,70.0,57.0,91.0,68.0,51.0,62.0,22.0,51.0,51.0,50.0
3,2012-10-01 15:00:00,76.0,80.0,86.0,80.0,88.0,81.0,21.0,23.0,49.0,...,70.0,57.0,87.0,68.0,51.0,62.0,22.0,51.0,51.0,50.0
4,2012-10-01 16:00:00,77.0,80.0,85.0,79.0,88.0,81.0,21.0,23.0,49.0,...,69.0,57.0,84.0,68.0,52.0,62.0,22.0,51.0,51.0,50.0
5,2012-10-01 17:00:00,78.0,79.0,84.0,79.0,88.0,80.0,21.0,24.0,49.0,...,69.0,57.0,80.0,68.0,54.0,62.0,23.0,51.0,51.0,50.0
6,2012-10-01 18:00:00,78.0,79.0,83.0,78.0,88.0,80.0,21.0,24.0,49.0,...,68.0,56.0,76.0,68.0,55.0,63.0,23.0,51.0,51.0,50.0
7,2012-10-01 19:00:00,79.0,78.0,82.0,77.0,88.0,80.0,21.0,24.0,49.0,...,68.0,56.0,72.0,68.0,56.0,63.0,23.0,51.0,51.0,50.0
8,2012-10-01 20:00:00,79.0,78.0,81.0,77.0,88.0,79.0,20.0,25.0,49.0,...,67.0,56.0,68.0,68.0,57.0,63.0,24.0,51.0,51.0,50.0
9,2012-10-01 21:00:00,80.0,77.0,80.0,76.0,88.0,79.0,20.0,25.0,49.0,...,67.0,55.0,64.0,68.0,58.0,64.0,24.0,51.0,51.0,50.0


In [None]:
data2.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
26234    False
26235    False
26236    False
26237    False
26238    False
Length: 26239, dtype: bool

In [None]:
data2.isnull().sum()

datetime                0
Vancouver             989
Portland              450
San Francisco         150
Seattle               290
Los Angeles           145
San Diego             303
Las Vegas             654
Phoenix              1251
Albuquerque           682
Denver               1806
San Antonio           565
Dallas                287
Houston               121
Kansas City           513
Minneapolis           509
Saint Louis           996
Chicago              1109
Nashville             568
Indianapolis          581
Atlanta               404
Detroit               863
Jacksonville          210
Charlotte             392
Miami                 248
Pittsburgh            511
Toronto               728
Philadelphia          608
New York              833
Montreal             1697
Boston                450
Beersheba              37
Tel Aviv District     308
Eilat                 164
Haifa                   7
Nahariyya               6
Jerusalem             101
dtype: int64

In [None]:
# we want to remove null values

data2.dropna(inplace=True)

In [None]:
data2.isnull().sum()

datetime             0
Vancouver            0
Portland             0
San Francisco        0
Seattle              0
Los Angeles          0
San Diego            0
Las Vegas            0
Phoenix              0
Albuquerque          0
Denver               0
San Antonio          0
Dallas               0
Houston              0
Kansas City          0
Minneapolis          0
Saint Louis          0
Chicago              0
Nashville            0
Indianapolis         0
Atlanta              0
Detroit              0
Jacksonville         0
Charlotte            0
Miami                0
Pittsburgh           0
Toronto              0
Philadelphia         0
New York             0
Montreal             0
Boston               0
Beersheba            0
Tel Aviv District    0
Eilat                0
Haifa                0
Nahariyya            0
Jerusalem            0
dtype: int64

In [None]:
data2.columns

Index(['datetime', 'Vancouver', 'Portland', 'San Francisco', 'Seattle',
       'Los Angeles', 'San Diego', 'Las Vegas', 'Phoenix', 'Albuquerque',
       'Denver', 'San Antonio', 'Dallas', 'Houston', 'Kansas City',
       'Minneapolis', 'Saint Louis', 'Chicago', 'Nashville', 'Indianapolis',
       'Atlanta', 'Detroit', 'Jacksonville', 'Charlotte', 'Miami',
       'Pittsburgh', 'Toronto', 'Philadelphia', 'New York', 'Montreal',
       'Boston', 'Beersheba', 'Tel Aviv District', 'Eilat', 'Haifa',
       'Nahariyya', 'Jerusalem'],
      dtype='object')

In [None]:
data2.head()

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
1,2012-10-01 13:00:00,76.0,81.0,88.0,81.0,88.0,82.0,22.0,23.0,50.0,...,71.0,58.0,93.0,68.0,50.0,63.0,22.0,51.0,51.0,50.0
2,2012-10-01 14:00:00,76.0,80.0,87.0,80.0,88.0,81.0,21.0,23.0,49.0,...,70.0,57.0,91.0,68.0,51.0,62.0,22.0,51.0,51.0,50.0
3,2012-10-01 15:00:00,76.0,80.0,86.0,80.0,88.0,81.0,21.0,23.0,49.0,...,70.0,57.0,87.0,68.0,51.0,62.0,22.0,51.0,51.0,50.0
4,2012-10-01 16:00:00,77.0,80.0,85.0,79.0,88.0,81.0,21.0,23.0,49.0,...,69.0,57.0,84.0,68.0,52.0,62.0,22.0,51.0,51.0,50.0
5,2012-10-01 17:00:00,78.0,79.0,84.0,79.0,88.0,80.0,21.0,24.0,49.0,...,69.0,57.0,80.0,68.0,54.0,62.0,23.0,51.0,51.0,50.0


In [None]:
# we want to change the datatype of datetime from object to be data

data2['datetime'] = pd.to_datetime(data2['datetime'])
data2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18898 entries, 1 to 26237
Data columns (total 37 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   datetime           18898 non-null  datetime64[ns]
 1   Vancouver          18898 non-null  float64       
 2   Portland           18898 non-null  float64       
 3   San Francisco      18898 non-null  float64       
 4   Seattle            18898 non-null  float64       
 5   Los Angeles        18898 non-null  float64       
 6   San Diego          18898 non-null  float64       
 7   Las Vegas          18898 non-null  float64       
 8   Phoenix            18898 non-null  float64       
 9   Albuquerque        18898 non-null  float64       
 10  Denver             18898 non-null  float64       
 11  San Antonio        18898 non-null  float64       
 12  Dallas             18898 non-null  float64       
 13  Houston            18898 non-null  float64       
 14  Kansas

In [None]:
data2.nlargest(1,'Vancouver')

Unnamed: 0,datetime,Vancouver,Portland,San Francisco,Seattle,Los Angeles,San Diego,Las Vegas,Phoenix,Albuquerque,...,Philadelphia,New York,Montreal,Boston,Beersheba,Tel Aviv District,Eilat,Haifa,Nahariyya,Jerusalem
361,2012-10-16 13:00:00,100.0,87.0,93.0,87.0,63.0,52.0,71.0,34.0,49.0,...,72.0,77.0,82.0,77.0,54.0,69.0,24.0,69.0,69.0,47.0
