 ### Split-Apply-Combine

Imagine que você tem um conjunto de dados contendo informações sobre vendas em diferentes regiões e deseja calcular a média de vendas por região.

- Split: Dividir o conjunto de dados em grupos com base na coluna "Região".
- Apply: Calcular a média de vendas para cada grupo.
- Combine: Juntar os resultados em um novo dataframe que mostra a média de vendas por região.

 Analisar:
 - Qual cidade a mais caras e mais baratas?
- Qual cidade tem as maiores e menores casas?
- Qual quarto banheiro combo sao os mais comuns?

In [2]:
import pandas as pd
import numpy as np

In [3]:
realtor_data = pd.read_csv("data+files/realtor-data.csv", parse_dates=["prev_sold_date"])
restaurants = pd.read_csv("data+files/california_restaurants.csv")

In [4]:
realtor_data.head(2)

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,prev_sold_date,price
0,for_sale,3.0,2.0,0.12,Adjuntas,Puerto Rico,601.0,920.0,NaT,105000.0
1,for_sale,4.0,2.0,0.08,Adjuntas,Puerto Rico,601.0,1527.0,NaT,80000.0


In [5]:

grp = realtor_data.groupby("city")
grp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11bf32a90>

In [6]:
for city, data in grp:
    print(city)
    print(data.shape)

Abbot
(15, 10)
Aberdeen
(201, 10)
Abington
(179, 10)
Absecon
(353, 10)
Absecon Highlands
(4, 10)
Accord
(5, 10)
Acton
(851, 10)
Acushnet
(97, 10)
Acworth
(131, 10)
Adams
(870, 10)
Adamstown Township
(7, 10)
Addisleigh Park
(27, 10)
Addison
(125, 10)
Adelphia
(3, 10)
Adjuntas
(33, 10)
Agawam
(872, 10)
Aguada
(261, 10)
Aguadilla
(188, 10)
Aguas Buenas
(82, 10)
Aibonito
(102, 10)
Airmont
(30, 10)
Albany
(71, 10)
Albertson
(5, 10)
Albion
(4, 10)
Alburgh
(75, 10)
Aldan
(43, 10)
Alexander
(28, 10)
Alexandria
(109, 10)
Alexandria Township
(27, 10)
Alford
(77, 10)
Alfred
(13, 10)
Allagash
(1, 10)
Allamuchy Township
(160, 10)
Allendale
(128, 10)
Allenhurst
(22, 10)
Allenstown
(96, 10)
Allentown
(184, 10)
Allenwood
(21, 10)
Alloway
(66, 10)
Allston
(30, 10)
Alna
(1, 10)
Alpha
(19, 10)
Alpine
(199, 10)
Alstead
(235, 10)
Alton
(551, 10)
Amawalk
(2, 10)
Ambler
(5, 10)
Amenia
(73, 10)
Amesbury
(484, 10)
Amherst
(1112, 10)
Amity
(10, 10)
Anasco
(53, 10)
Ancram
(56, 10)
Ancramdale
(6, 10)
Andover
(121

In [7]:
# criar uma nova coluna com cidade e estado
realtor_data["city_state"] = realtor_data["city"] + ", " + realtor_data["state"]
realtor_data.sample(3)

Unnamed: 0,status,bed,bath,acre_lot,city,state,zip_code,house_size,prev_sold_date,price,city_state
325202,for_sale,8.0,3.0,0.12,Winslow,Maine,4901.0,2600.0,NaT,200000.0,"Winslow, Maine"
144297,for_sale,2.0,1.0,,Boston,Massachusetts,2215.0,775.0,2007-10-04,690000.0,"Boston, Massachusetts"
402822,for_sale,4.0,4.0,0.61,Simsbury,Connecticut,6070.0,2391.0,2012-07-02,400000.0,"Simsbury, Connecticut"


In [8]:
# as_index=False - se nao colocar transforma em series
grp = realtor_data.groupby("city_state", as_index=False)
for city, data in grp:
    print(city)
    print(data.shape)

Abbot, Maine
(15, 11)
Aberdeen, New Jersey
(201, 11)
Abington, Massachusetts
(166, 11)
Abington, Pennsylvania
(13, 11)
Absecon Highlands, New Jersey
(4, 11)
Absecon, New Jersey
(353, 11)
Accord, New York
(5, 11)
Acton, Maine
(53, 11)
Acton, Massachusetts
(798, 11)
Acushnet, Massachusetts
(97, 11)
Acworth, New Hampshire
(131, 11)
Adams, Massachusetts
(870, 11)
Adamstown Township, Maine
(7, 11)
Addisleigh Park, New York
(27, 11)
Addison, Maine
(93, 11)
Addison, Vermont
(32, 11)
Adelphia, New Jersey
(3, 11)
Adjuntas, Puerto Rico
(33, 11)
Agawam, Massachusetts
(872, 11)
Aguada, Puerto Rico
(261, 11)
Aguadilla, Puerto Rico
(188, 11)
Aguas Buenas, Puerto Rico
(82, 11)
Aibonito, Puerto Rico
(102, 11)
Airmont, New York
(30, 11)
Albany, New Hampshire
(28, 11)
Albany, Vermont
(43, 11)
Albertson, New York
(5, 11)
Albion, Maine
(4, 11)
Alburgh, Vermont
(75, 11)
Aldan, Pennsylvania
(43, 11)
Alexander, Maine
(28, 11)
Alexandria Township, New Jersey
(27, 11)
Alexandria, New Hampshire
(109, 11)
Alford

In [9]:
median_price_by_city = grp.price.median()
median_price_by_city.shape

(3061, 2)

In [10]:
median_price_by_city

Unnamed: 0,city_state,price
0,"Abbot, Maine",39900.0
1,"Aberdeen, New Jersey",499900.0
2,"Abington, Massachusetts",525000.0
3,"Abington, Pennsylvania",320000.0
4,"Absecon Highlands, New Jersey",89000.0
...,...,...
3056,"Yauco, Puerto Rico",70500.0
3057,"Yeadon, Pennsylvania",215000.0
3058,"Yonkers, New York",299900.0
3059,"York, Maine",619000.0


In [11]:
median_price_by_city = grp.price.agg(['median', 'max', 'min', 'mean','std', "count"])
median_price_by_city

Unnamed: 0,city_state,median,max,min,mean,std,count
0,"Abbot, Maine",39900.0,425000.0,21900.0,200013.333333,1.922600e+05,15
1,"Aberdeen, New Jersey",499900.0,849900.0,49000.0,452762.691542,1.613619e+05,201
2,"Abington, Massachusetts",525000.0,7900000.0,250000.0,821046.385542,1.498363e+06,166
3,"Abington, Pennsylvania",320000.0,399440.0,229000.0,306587.692308,5.231031e+04,13
4,"Absecon Highlands, New Jersey",89000.0,89000.0,89000.0,89000.000000,0.000000e+00,4
...,...,...,...,...,...,...,...
3056,"Yauco, Puerto Rico",70500.0,4000000.0,32500.0,151775.925926,3.914199e+05,108
3057,"Yeadon, Pennsylvania",215000.0,280000.0,25000.0,208861.538462,7.927538e+04,13
3058,"Yonkers, New York",299900.0,2199000.0,44900.0,421251.046487,3.096938e+05,1893
3059,"York, Maine",619000.0,5950000.0,21000.0,929687.190698,1.222300e+06,430


In [12]:
median_price_by_city.sort_values("median", ascending=False, inplace=True)

In [13]:
median_price_by_city

Unnamed: 0,city_state,median,max,min,mean,std,count
2821,"Waterfront, Massachusetts",12000000.0,12000000.0,12000000.0,1.200000e+07,0.000000e+00,25
2243,"Rochdale Village, New York",9800000.0,9800000.0,9800000.0,9.800000e+06,0.000000e+00,3
1628,"Middletown Township, New Jersey",9199999.5,17500000.0,899999.0,9.200000e+06,9.092195e+06,6
2424,"Siasconset, Massachusetts",6495000.0,6495000.0,6495000.0,6.495000e+06,0.000000e+00,3
2650,"Tisbury, Massachusetts",5750000.0,11600000.0,699000.0,6.406333e+06,4.828756e+06,15
...,...,...,...,...,...,...,...
1930,"Oakfield, Maine",14000.0,30000.0,14000.0,1.931667e+04,7.853295e+03,12
1969,"Orient, Maine",12500.0,12500.0,12500.0,1.250000e+04,,1
1398,"Lehman, Pennsylvania",12500.0,17500.0,9900.0,1.310909e+04,2.700909e+03,11
1377,"Laureldale, New Jersey",9900.0,279900.0,9900.0,3.066923e+04,7.488453e+04,13


In [14]:
home_sizes_by_city = (grp.house_size
 .agg(['median', 'max', 'min', 'mean','std', "count"])
 .sort_values("median", ascending=False)
 )

In [15]:
home_sizes_by_city

Unnamed: 0,city_state,median,max,min,mean,std,count
2634,"Tenafly, New Jersey",47916.0,47916.0,47916.0,47916.000000,0.000000,8
962,"Garfield, New Jersey",19998.0,19998.0,1556.0,11885.848485,8959.957859,66
609,"Cresskill, New Jersey",19110.0,19110.0,19110.0,19110.000000,0.000000,11
1879,"North Haledon, New Jersey",17250.0,17250.0,4152.0,15794.666667,4366.000000,9
2243,"Rochdale Village, New York",15000.0,15000.0,15000.0,15000.000000,0.000000,3
...,...,...,...,...,...,...,...
2962,"Williamsburg, New York",,,,,,0
2991,"Windsor, New Hampshire",,,,,,0
2997,"Winslow Township, New Jersey",,,,,,0
3020,"Woodcliff Lake, New Jersey",,,,,,0


In [16]:
counts = realtor_data.groupby(['bed','bath'], as_index=False)['city'].count().rename({"city":"count"}, axis=1)

In [17]:
counts = counts.sort_values('count',ascending=False).reset_index(drop=False)

In [18]:
# reorganizar codigo
counts = (realtor_data
          .groupby(['bed','bath'], as_index=False)['city']
          .count()
          .rename({"city":"count"}, axis=1)
          .sort_values('count',ascending=False).reset_index(drop=False)
          )

In [19]:
counts

Unnamed: 0,index,bed,bath,count
0,15,3.0,2.0,114283
1,8,2.0,2.0,88319
2,0,1.0,1.0,71933
3,16,3.0,3.0,69601
4,7,2.0,1.0,57583
...,...,...,...,...
258,250,36.0,12.0,1
259,249,33.0,35.0,1
260,199,18.0,18.0,1
261,5,1.0,6.0,1


In [20]:
# mais facil
realtor_data[['bed', 'bath']].value_counts()

bed    bath 
3.0    2.0      114283
2.0    2.0       88319
1.0    1.0       71937
3.0    3.0       69606
2.0    1.0       57585
                 ...  
36.0   12.0          1
33.0   35.0          1
18.0   18.0          1
1.0    6.0           1
123.0  123.0         1
Name: count, Length: 263, dtype: int64

In [21]:
restaurants.sample(2)

Unnamed: 0,Yelp URL,Name,Street Address,Zip Code,City,State,Price Range,Phone,Rating,Number of Reviews,Website,Menu Link,Image 1,Image 2,Image 3,Category 1,Category 2,Category 3
1089,https://www.yelp.com/biz/tasty-xian-diamond-bar,Tasty Xi'an,1155 S Diamond Bar Blvd Ste L,91765.0,Diamond Bar,CA,,(909) 396-9865,4.0,53.0,www.tastyxian.com,,https://s3-media0.fl.yelpcdn.com/bphoto/f9eA-g...,https://s3-media0.fl.yelpcdn.com/bphoto/YqppB0...,https://s3-media0.fl.yelpcdn.com/bphoto/1W5WjH...,Chinese,,
1108,https://www.yelp.com/biz/sun-nong-dan-rowland-...,Sun Nong Dan,18902 A E Gale Ave,91748.0,Rowland Heights,CA,$$,(626) 581-2233,4.0,1134.0,www.sunnongdan.net,www.sunnongdan.com/menu,https://s3-media0.fl.yelpcdn.com/bphoto/4iEoM_...,https://s3-media0.fl.yelpcdn.com/bphoto/2lqwTK...,https://s3-media0.fl.yelpcdn.com/bphoto/N1qpU8...,Korean,Soup,


In [26]:
rating_by_city = restaurants.groupby("City")["Rating"].agg(["mean", "median", "count", "std"]).sort_values("mean", ascending=False)

In [28]:
rating_by_city.loc[rating_by_city["count"]>=10]

Unnamed: 0_level_0,mean,median,count,std
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Chino,4.5,4.5,24,0.361158
Ontario,4.5,4.5,11,0.316228
Walnut,4.481481,4.5,27,0.448962
Garden Grove,4.428571,4.5,35,0.422577
Pomona,4.413793,4.5,29,0.464238
La Mirada,4.394737,4.5,19,0.657836
Buena Park,4.375,4.5,44,0.375484
West Covina,4.333333,4.5,39,0.331133
La Habra,4.333333,4.5,33,0.645497
Whittier,4.314286,4.5,35,0.299158


In [39]:
result = restaurants.groupby(["City", "Price Range"])["Rating"].agg(["mean", "median", "count", "std"]).sort_values("mean", ascending=False)

In [42]:
result[result["count"] >= 10]

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,median,count,std
City,Price Range,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Walnut,$$,4.470588,4.5,17,0.483173
Anaheim,$,4.409091,4.5,22,0.590326
La Habra,$$,4.386364,4.5,22,0.554888
Chino,$$,4.361111,4.5,18,0.287257
Fullerton,$,4.354167,4.5,24,0.453948
City of Industry,$$,4.318182,4.5,11,0.252262
Placentia,$$,4.309524,4.5,21,0.334522
Buena Park,$$,4.282609,4.5,23,0.331185
Whittier,$$,4.28,4.5,25,0.291548
Anaheim,$$,4.275,4.5,100,0.48915
