<a href="https://colab.research.google.com/github/Neerajmn28/100-Days-of-Python/blob/main/Data%20Analysing%20with%20Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Pandas is a powerful and widely-used open-source Python library for data manipulation and analysis. It is built on top of the NumPy library and provides easy-to-use data structures and tools for working with structured data.

Key Features:

1) Data Structures:

* Series: A one-dimensional labeled array capable of holding any data type (e.g., integers, floats, strings).
* DataFrame: A two-dimensional labeled data structure, similar to a spreadsheet or SQL table, with rows and columns.

2) Data Manipulation:

* Supports operations like filtering, grouping, and aggregating data.
Allows reshaping, merging, and joining datasets.
File Handling:

* Can read/write data from/to various file formats like CSV, Excel, SQL databases, JSON, etc.

3) Data Cleaning:

* Offers tools to handle missing values, duplicates, and incorrect data.
* Provides robust methods for transforming and reformatting data.

4) Data Analysis:

* Allows statistical analysis, descriptive statistics, and applying custom functions to data.
* Facilitates time-series analysis with date-time indexing.

4) Visualization:

* Integrates seamlessly with libraries like Matplotlib and Seaborn to create plots and graphs.

* Why Use Pandas?
Pandas is particularly useful for:

* Quickly exploring large datasets.
Performing complex data wrangling and transformation tasks with minimal code.
Preparing data for machine learning and statistical modeling.

In [42]:
with open('/content/weather_data.csv') as data_file:
  data = data_file.readlines()
  print(data)

['day,temp,condition\n', 'Monday,12,Sunny\n', 'Tuesday,14,Rain\n', 'Wednesday,15,Rain\n', 'Thursday,14,Cloudy\n', 'Friday,21,Sunny\n', 'Saturday,22,Sunny\n', 'Sunday,24,Sunny']


The readlines() method in Python reads all the lines of a file and returns them as a list of strings.

Here each item is a row, it will take a lot of cleaning to extract this data into a desired format.

In [2]:
import csv

In [3]:
with open('weather_data.csv') as data_file:
  data = csv.reader(data_file)
  temperatures = []
  for row in data:
    if row[1] != 'temp':
      temperatures.append(int(row[1]))
  print(temperatures)

[12, 14, 15, 14, 21, 22, 24]


To deal with more complex data it is better to go with Pandas library.

In [4]:
import pandas

In [5]:
df = pandas.read_csv('weather_data.csv')
print(df)

         day  temp condition
0     Monday    12     Sunny
1    Tuesday    14      Rain
2  Wednesday    15      Rain
3   Thursday    14    Cloudy
4     Friday    21     Sunny
5   Saturday    22     Sunny
6     Sunday    24     Sunny


In [6]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


In [7]:
# Convert the df into a dictionary

data_dict = df.to_dict()
print(data_dict)

{'day': {0: 'Monday', 1: 'Tuesday', 2: 'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}, 'temp': {0: 12, 1: 14, 2: 15, 3: 14, 4: 21, 5: 22, 6: 24}, 'condition': {0: 'Sunny', 1: 'Rain', 2: 'Rain', 3: 'Cloudy', 4: 'Sunny', 5: 'Sunny', 6: 'Sunny'}}


In [8]:
# Covert the df into a list

temp_list = df['temp'].to_list()
print(len(temp_list))

7


In [9]:
# This is one method of calculating mean
avg = sum(temp_list)/len(temp_list)
print(avg)

17.428571428571427


In [10]:
df['temp'].mean() # This is using pandas

17.428571428571427

In [11]:
max_temp = df['temp'].max() # This is max function
print(max_temp)

24


In [12]:
# Get data in columns
print(df['condition'])

0     Sunny
1      Rain
2      Rain
3    Cloudy
4     Sunny
5     Sunny
6     Sunny
Name: condition, dtype: object


In [13]:
# The other method is to get data from the columns
print(df.day)

0       Monday
1      Tuesday
2    Wednesday
3     Thursday
4       Friday
5     Saturday
6       Sunday
Name: day, dtype: object


In [14]:
# To get a data in a row
print(df[df.day == 'Monday'])

      day  temp condition
0  Monday    12     Sunny


In [15]:
print(df[df.day == 'Monday'])
print(df[df.temp == df.temp.max()])

      day  temp condition
0  Monday    12     Sunny
      day  temp condition
6  Sunday    24     Sunny


In [16]:
monday = df[df.day == 'Monday']
print(monday)
print(monday.condition)

      day  temp condition
0  Monday    12     Sunny
0    Sunny
Name: condition, dtype: object


In [17]:
df

Unnamed: 0,day,temp,condition
0,Monday,12,Sunny
1,Tuesday,14,Rain
2,Wednesday,15,Rain
3,Thursday,14,Cloudy
4,Friday,21,Sunny
5,Saturday,22,Sunny
6,Sunday,24,Sunny


In [18]:
monday_temp = monday.temp[0]
monday_temp * 9/5 + 32

53.6

In [19]:
# Create a data frame

data = {
    'students' : ['Amy','James','Angela'],
    'scores' : [76,56,65]
}

In [20]:
data = pandas.DataFrame(data_dict)
print(data)

         day  temp condition
0     Monday    12     Sunny
1    Tuesday    14      Rain
2  Wednesday    15      Rain
3   Thursday    14    Cloudy
4     Friday    21     Sunny
5   Saturday    22     Sunny
6     Sunday    24     Sunny


In [21]:
# To convert the data into csv file
data.to_csv('new_data.csv')

### Squirrel Data Set

In [22]:
squirrel_data = pandas.read_csv('/content/2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv')
squirrel_data.head(2)

Unnamed: 0,X,Y,Unique Squirrel ID,Hectare,Shift,Date,Hectare Squirrel Number,Age,Primary Fur Color,Highlight Fur Color,...,Kuks,Quaas,Moans,Tail flags,Tail twitches,Approaches,Indifferent,Runs from,Other Interactions,Lat/Long
0,-73.956134,40.794082,37F-PM-1014-03,37F,PM,10142018,3,,,,...,False,False,False,False,False,False,False,False,,POINT (-73.9561344937861 40.7940823884086)
1,-73.957044,40.794851,37E-PM-1006-03,37E,PM,10062018,3,Adult,Gray,Cinnamon,...,False,False,False,False,False,False,False,True,me,POINT (-73.9570437717691 40.794850940803904)


In [26]:
grey_squirrel_count = len(squirrel_data[squirrel_data['Primary Fur Color'] == 'Gray'])
cinnamon_squirrel_count = len(squirrel_data[squirrel_data['Primary Fur Color'] == 'Cinnamon'])
black_squirrel_count = len(squirrel_data[squirrel_data['Primary Fur Color'] == 'Black'])

print(grey_squirrel_count)
print(cinnamon_squirrel_count)
print(black_squirrel_count)

2473
392
103


In [27]:
df = {
    'Fur Color' : ['Gray','Cinnamon','Black'],
    'Count' : [grey_squirrel_count, cinnamon_squirrel_count, black_squirrel_count]
}

In [28]:
pandas.DataFrame(df)

Unnamed: 0,Fur Color,Count
0,Gray,2473
1,Cinnamon,392
2,Black,103
