First, save the paths to the necessary files as objects and create RDDs from the objects

In [2]:
accountsPath = "/loudacre/accounts/part-m-*"
weblogsPath = "/loudacre/weblogs/FlumeData.*"
baseStationsPath = "/loudacre/base_stations.tsv"

accountsRDD = sc.textFile(accountsPath)
weblogsRDD = sc.textFile(weblogsPath)
baseStationsRDD = sc.textFile(baseStationsPath)

How many users are there in total?  Assuming each user has a distinct user ID, map by user ID and count the number of distinct entries

In [3]:
totalUsers = accountsRDD.map(lambda line:(line.split(',')[0]))\
    .distinct()\
    .count()
print('there are ' + str(totalUsers) + ' unique users in total')

there are 129764 unique users in total


How many cities have users? In other words, how many unique cities appear in the user accounts records?
As above, we map on the city column and count the number of distinct entries (the city is stored in column 6 - this is sense-checked in the commented line).

In [4]:
totalUserCities = accountsRDD.map(lambda line:(line.split(',')[6]))\
    .distinct()
    
#print(totalUserCities.take(100))

totalUserCities=totalUserCities.count()

print('there are ' + str(totalUserCities) + ' unique cities with users')

there are 56 unique cities with users


How many cities have base stations? In other words, how many unique cities are listed in the base stations records?
Map on the city column (note, don't split by comma this time) and count the number of distinct entries

In [5]:
totalBaseStationCities = baseStationsRDD .map(lambda line:(line.split('\t')[2]))\
    .distinct()\
#sense-check to make sure we have the correct column
#print(totalBaseStationCities.take(5))

totalBaseStationCities=totalBaseStationCities.count()

print('there are ' + str(totalBaseStationCities) + ' unique cities with base stations')


there are 238 unique cities with base stations


How many base stations are there? Let's not assume each station key (column 0) has been entered sequentially, even though it looks like they have been. Instead count the number of unique keys

In [6]:
totalBaseStations = baseStationsRDD.map(lambda line:(line.split('\t')[0]))\
    .distinct()\
    .count()
print('there are ' + str(totalBaseStations) + ' base stations in total')

there are 377 base stations in total


Number of users in each city
To do this, first create an RDD from the accounts data with city as the key and 1 as the value. We could reducebycount on this, but it wouldn't include cities with zero users (I am assuming we want cities with base stations but no users included too).
To include these, create a RDD from the base station data with city as the key and 0 as the value.
Union these RDDs and reduceByKey to get the number of users for each city.

All cities with users or base stations are shown, ordered by total number of users (descending)

In [7]:
#always filter first for efficiency
ORuserCityRDD = accountsRDD.filter(lambda line: line.split(',')[7] == "OR")\
    .map(lambda line:(line.split(',')[6],1))
   
ORcityBaseStationRDD = baseStationsRDD.filter(lambda line: line.split('\t')[3]=="OR")\
    .map(lambda line:(line.split('\t')[2],0))

ORuserCityRDD=ORuserCityRDD.union(ORcityBaseStationRDD)\
    .reduceByKey(lambda v1,v2: v1+v2)

ORuserCityRDD.takeOrdered(100, lambda(city,value): -1 * value)

#TODO - print instead of take

[(u'Portland', 4602),
 (u'Bend', 1528),
 (u'Eugene', 1520),
 (u'Medford', 1511),
 (u'Salem', 1496),
 (u'Klamath Falls', 1463),
 (u'Pendleton', 1455),
 (u'Umatilla', 0),
 (u'Butte Falls', 0),
 (u'Saint Benedict', 0),
 (u'Beaver', 0),
 (u'Bridgeport', 0),
 (u'Gaston', 0),
 (u'Oregon City', 0),
 (u'Riverside', 0),
 (u'Government Camp', 0),
 (u'Wilbur', 0),
 (u'Bates', 0),
 (u'Trail', 0),
 (u'Long Creek', 0),
 (u'Bridal Veil', 0),
 (u'North Powder', 0),
 (u'Molalla', 0),
 (u'Corvallis', 0),
 (u'Halsey', 0),
 (u'Baker City', 0),
 (u'Dillard', 0),
 (u'Junction City', 0)]

Count of users and bases for each city
To achieve this we map both the user and baseStation RDDs with city as the key. We can then reducebykey to count the number of each and join (full join necessary to include cities with no base station)
The city is taken as a user input (for example 'Sacramento' provides the answer to the original question; 'Salem' is an example with zero base stations - giving 'None')

In [46]:
usersByCity = accountsRDD.map(lambda line:(line.split(',')[6],1))\
    .reduceByKey(lambda v1,v2: v1+v2)
    
baseStationsByCity = baseStationsRDD.map(lambda line:(line.split('\t')[2],1))\
    .reduceByKey(lambda v1,v2: v1+v2)

#join the above RDDs to get users and stations per city
countsByCity=usersByCity.fullOuterJoin(baseStationsByCity)\
    .filter(lambda(key, value): value[0]!=None) #filter to only include cities with users


#take user input and return data for city
requestedCity=raw_input('please enter city: ')
cityRow = countsByCity.filter(lambda (key, value): key.lower()==requestedCity.lower())

if cityRow.count() > 0:
    cityRow=cityRow.take(1)
    for (city, values) in cityRow:
        print 'city:' , city, '|\ttotal users:', values[0], '\ttotal base stations:', values[1]

else:
    print 'city name not recognised'

please enter city: sacramento
city: Sacramento |	total users: 6820 	total base stations: 4


TypeError: count() takes exactly one argument (0 given)