The problem is due to the download link you are using to download spark:

http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz

To download spark without having any problem, you should download it from their archive website (https://archive.apache.org/dist/spark).

For example, the following download link from their archive website works fine:

https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

Here is the complete code to install and setup java, spark and pyspark:

# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

# set your spark folder to your system path environment. 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"


# install findspark using pip
!pip install -q findspark

For python users, you should also install pyspark using the following command.

!pip install pyspark

Here is the link to this source: https://stackoverflow.com/questions/55240940/error-while-installing-spark-on-google-colab 

Install needed dependencies

In [None]:
#For python users, you should also install pyspark using the following command.
!pip install pyspark
!pip install findspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


import libraries and modulesm, implement spark session and connect to local master node

In [None]:
import pandas as pd
import pyspark
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()
spark

1. Display Top 10 Rows of The Dataset


In [None]:
data = pd.read_csv("/content/Ecommerce_purchases.csv")
df = spark.createDataFrame(data)
df.show(10)

+--------------------+-----+--------+--------------------+--------------------+----------------+-----------+----------------+--------------------+--------------------+--------------------+---------------+--------+--------------+
|             Address|  Lot|AM or PM|        Browser Info|             Company|     Credit Card|CC Exp Date|CC Security Code|         CC Provider|               Email|                 Job|     IP Address|Language|Purchase Price|
+--------------------+-----+--------+--------------------+--------------------+----------------+-----------+----------------+--------------------+--------------------+--------------------+---------------+--------+--------------+
|16629 Pace Camp A...|46 in|      PM|Opera/9.56.(X11; ...|     Martinez-Herman|6011929061123406|      02/20|             900|        JCB 16 digit|   pdunlap@yahoo.com|Scientist, produc...|149.146.147.205|      el|         98.14|
|9374 Jasmine Spur...|28 rn|      PM|Opera/8.93.(Windo...|Fletcher, Richard...|33377

2. Check Last 10 Rows of The Dataset

In [None]:
########## Extract last N rows of the dataframe in pyspark
 
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import desc
 
df2 = df.withColumn("index", monotonically_increasing_id())
df2.orderBy(desc("index")).drop("index").show(10)

+--------------------+-----+--------+--------------------+--------------------+----------------+-----------+----------------+----------------+--------------------+--------------------+---------------+--------+--------------+
|             Address|  Lot|AM or PM|        Browser Info|             Company|     Credit Card|CC Exp Date|CC Security Code|     CC Provider|               Email|                 Job|     IP Address|Language|Purchase Price|
+--------------------+-----+--------+--------------------+--------------------+----------------+-----------+----------------+----------------+--------------------+--------------------+---------------+--------+--------------+
|40674 Barrett Str...|64 Hr|      AM|Mozilla/5.0 (X11;...|          Greene Inc|4139972901927273|      02/19|             302|    JCB 15 digit|rachelford@vaughn...|Embryologist, cli...|176.119.198.199|      el|         67.59|
|0096 English Rest...|74 cL|      PM|Mozilla/5.0 (Maci...|            Cook Inc| 180003348082930|    

3. Check Datatype of Each Column

In [None]:
#Get All column names and it's types
for field in df.schema.fields:
    print(field.name +" and its data type is: "+str(field.dataType))

Address and its data type is: StringType()
Lot and its data type is: StringType()
AM or PM and its data type is: StringType()
Browser Info and its data type is: StringType()
Company and its data type is: StringType()
Credit Card and its data type is: LongType()
CC Exp Date and its data type is: StringType()
CC Security Code and its data type is: LongType()
CC Provider and its data type is: StringType()
Email and its data type is: StringType()
Job and its data type is: StringType()
IP Address and its data type is: StringType()
Language and its data type is: StringType()
Purchase Price and its data type is: DoubleType()


In [None]:
### Get count of null values in pyspark
 
from pyspark.sql.functions import isnan, when, count, col
df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

+-------+---+--------+------------+-------+-----------+-----------+----------------+-----------+-----+---+----------+--------+--------------+
|Address|Lot|AM or PM|Browser Info|Company|Credit Card|CC Exp Date|CC Security Code|CC Provider|Email|Job|IP Address|Language|Purchase Price|
+-------+---+--------+------------+-------+-----------+-----------+----------------+-----------+-----+---+----------+--------+--------------+
|      0|  0|       0|           0|      0|          0|          0|               0|          0|    0|  0|         0|       0|             0|
+-------+---+--------+------------+-------+-----------+-----------+----------------+-----------+-----+---+----------+--------+--------------+



Pandas Profile even reported no missing/null values.

5. How many rows and columns are there in our Dataset?

In [None]:
print("The number of Rows: ", df.count())
print("The number of Columns: ", len(df.columns))

The number of Rows:  10000
The number of Columns:  14


6. Show all columns first then find the length of them.

In [None]:
from pyspark.sql.functions import length
df3 = df.withColumn('Address_length', length(df.Address)).show()
df3 = df.withColumn('Lot_length', length(df.Lot)).show()
df3 = df.withColumn('AM or PM length', length(df['AM or PM'])).show()
df3 = df.withColumn('Browser Info_length', length(df['Browser Info'])).show()
df3 = df.withColumn('Company_length', length(df['Company'])).show()
df3 = df.withColumn('Credit Card_length', length(df['Credit Card'])).show()
df3 = df.withColumn('CC Exp Date_length', length(df['CC Exp Date'])).show()

+--------------------+-----+--------+--------------------+--------------------+----------------+-----------+----------------+--------------------+--------------------+--------------------+---------------+--------+--------------+--------------+
|             Address|  Lot|AM or PM|        Browser Info|             Company|     Credit Card|CC Exp Date|CC Security Code|         CC Provider|               Email|                 Job|     IP Address|Language|Purchase Price|Address_length|
+--------------------+-----+--------+--------------------+--------------------+----------------+-----------+----------------+--------------------+--------------------+--------------------+---------------+--------+--------------+--------------+
|16629 Pace Camp A...|46 in|      PM|Opera/9.56.(X11; ...|     Martinez-Herman|6011929061123406|      02/20|             900|        JCB 16 digit|   pdunlap@yahoo.com|Scientist, produc...|149.146.147.205|      el|         98.14|            54|
|9374 Jasmine Spur...|28

In [None]:
df.count()

10000

In [None]:
df.printSchema()

root
 |-- Address: string (nullable = true)
 |-- Lot: string (nullable = true)
 |-- AM or PM: string (nullable = true)
 |-- Browser Info: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Credit Card: long (nullable = true)
 |-- CC Exp Date: string (nullable = true)
 |-- CC Security Code: long (nullable = true)
 |-- CC Provider: string (nullable = true)
 |-- Email: string (nullable = true)
 |-- Job: string (nullable = true)
 |-- IP Address: string (nullable = true)
 |-- Language: string (nullable = true)
 |-- Purchase Price: double (nullable = true)



In [None]:
df.summary().show()

+-------+--------------------+-----+--------+--------------------+-------------+--------------------+-----------+------------------+----------------+-------------------+------------------+------------+--------+------------------+
|summary|             Address|  Lot|AM or PM|        Browser Info|      Company|         Credit Card|CC Exp Date|  CC Security Code|     CC Provider|              Email|               Job|  IP Address|Language|    Purchase Price|
+-------+--------------------+-----+--------+--------------------+-------------+--------------------+-----------+------------------+----------------+-------------------+------------------+------------+--------+------------------+
|  count|               10000|10000|   10000|               10000|        10000|               10000|      10000|             10000|           10000|              10000|             10000|       10000|   10000|             10000|
|   mean|                null| null|    null|                null|         null|

8. Show complete information of dataset.

In [None]:
#convert spark dataframe to 
pandasDF = df.select("*").toPandas()

In [None]:
pandasDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Address           10000 non-null  object 
 1   Lot               10000 non-null  object 
 2   AM or PM          10000 non-null  object 
 3   Browser Info      10000 non-null  object 
 4   Company           10000 non-null  object 
 5   Credit Card       10000 non-null  int64  
 6   CC Exp Date       10000 non-null  object 
 7   CC Security Code  10000 non-null  int64  
 8   CC Provider       10000 non-null  object 
 9   Email             10000 non-null  object 
 10  Job               10000 non-null  object 
 11  IP Address        10000 non-null  object 
 12  Language          10000 non-null  object 
 13  Purchase Price    10000 non-null  float64
dtypes: float64(1), int64(2), object(11)
memory usage: 1.1+ MB


In [None]:
print("Percentage of Missing data within dataset: ", pandasDF.isnull().sum() * 100 / len(pandasDF))

Percentage of Missing data within dataset:  Address             0.0
Lot                 0.0
AM or PM            0.0
Browser Info        0.0
Company             0.0
Credit Card         0.0
CC Exp Date         0.0
CC Security Code    0.0
CC Provider         0.0
Email               0.0
Job                 0.0
IP Address          0.0
Language            0.0
Purchase Price      0.0
dtype: float64


In [None]:
print("Percentage of duplicated data/rows within dataset: ", pandasDF.duplicated().sum() * 100 / len(pandasDF))

Percentage of duplicated data/rows within dataset:  0.0


9. Show Highest and Lowest Purchase Prices.

In [None]:
#import needed aggregate functions
from pyspark.sql.functions import countDistinct, avg, stddev, max, min, ceil, round
df.select(max('Purchase Price').alias('Highest Purchase Price:')).show()

+-----------------------+
|Highest Purchase Price:|
+-----------------------+
|                  99.99|
+-----------------------+



In [None]:
df.select(min('Purchase Price').alias('Lowest Purchase Price:')).show()

+----------------------+
|Lowest Purchase Price:|
+----------------------+
|                   0.0|
+----------------------+



10. Show average purchase price

In [None]:
df.select(round(avg('Purchase Price')).alias('Average Purchase Price:')).show()

+-----------------------+
|Average Purchase Price:|
+-----------------------+
|                   50.0|
+-----------------------+



11. How many people have French 'fr' as their Language?

In [None]:
#Filter all rows that contains string 'fr' in a 'Language' column
#check if see "fr" value is present in column
df.filter(col("Language").contains("fr")).show(truncate=False)

+---------------------------------------------------------------+-----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+----------------+-----------+----------------+---------------------------+--------------------------------+-------------------------------------+---------------+--------+--------------+
|Address                                                        |Lot  |AM or PM|Browser Info                                                                                                                                      |Company                        |Credit Card     |CC Exp Date|CC Security Code|CC Provider                |Email                           |Job                                  |IP Address     |Language|Purchase Price|
+---------------------------------------------------------------+-----+--------+------------------------------

In [None]:
print("Total Number of People whom have french(fr) as there language: ", df.filter(df['Language'] == 'fr').count())

Total Number of People whom have french(fr) as there language:  1097


In [None]:
#count how many people have a certain language selected
df.groupBy('Language').agg(count("Language").alias("Number of People")).show()

+--------+----------------+
|Language|Number of People|
+--------+----------------+
|      en|            1098|
|      pt|            1118|
|      de|            1155|
|      es|            1095|
|      el|            1137|
|      it|            1086|
|      ru|            1155|
|      zh|            1059|
|      fr|            1097|
+--------+----------------+



12. Find Job Title Contains Engineer word 

In [None]:
#Filter all rows that contains string 'engineer' in a 'Job' column
#check if see "engineer" value is present in column
df.filter(col("Job").contains("engineer")).show(truncate=False)

+----------------------------------------------------------------+-----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+----------------+-----------+----------------+----------------+------------------------------+------------------------------------+---------------+--------+--------------+
|Address                                                         |Lot  |AM or PM|Browser Info                                                                                                                                      |Company                        |Credit Card     |CC Exp Date|CC Security Code|CC Provider     |Email                         |Job                                 |IP Address     |Language|Purchase Price|
+----------------------------------------------------------------+-----+--------+-------------------------------------------------------

In [None]:
#Filter all rows that contains string 'Engineer' in a 'Job' column
#check if see "Engineer" value is present in column
df.filter(col("Job").contains("Engineer")).show(truncate=False)

+---------------------------------------------------------------+-----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------+----------------+-----------+----------------+----------------+-------------------------------+-----------------------------------+---------------+--------+--------------+
|Address                                                        |Lot  |AM or PM|Browser Info                                                                                                                                      |Company                      |Credit Card     |CC Exp Date|CC Security Code|CC Provider     |Email                          |Job                                |IP Address     |Language|Purchase Price|
+---------------------------------------------------------------+-----+--------+--------------------------------------------------------------

Engineer has both upper and lowercase.

In [None]:
print("Total Number of People whom have engineer(lowercase e) as there Job: ", df.filter(df['Job'].contains("engineer")).count())

Total Number of People whom have engineer(lowercase e) as there Job:  531


In [None]:
print("Total Number of People whom have engineer(uppercase E) as there Job: ", df.filter(df['Job'].contains("Engineer")).count())

Total Number of People whom have engineer(uppercase E) as there Job:  453


Total is 984 for both upper and lowercase.

In [None]:
#count how many people have a certain Job selected
df.groupBy('Job').agg(count("Job").alias("Number of People with that certain Job")).show(1000, truncate=False)

+-----------------------------------------------------------+--------------------------------------+
|Job                                                        |Number of People with that certain Job|
+-----------------------------------------------------------+--------------------------------------+
|Retail merchandiser                                        |15                                    |
|Engineer, aeronautical                                     |16                                    |
|Catering manager                                           |14                                    |
|Librarian, academic                                        |19                                    |
|Diplomatic Services operational officer                    |8                                     |
|Designer, ceramics/pottery                                 |19                                    |
|Occupational hygienist                                     |17                            

13. Find The Email of the person with the following IP Address: 132.207.160.22

In [None]:
df.filter(col("IP Address").contains("132.207.160.22")).show(truncate=False)

+----------------------------------+-----+--------+---------------------------------------------------------------+--------------------------+------------+-----------+----------------+------------+------------------------------+------------------------+--------------+--------+--------------+
|Address                           |Lot  |AM or PM|Browser Info                                                   |Company                   |Credit Card |CC Exp Date|CC Security Code|CC Provider |Email                         |Job                     |IP Address    |Language|Purchase Price|
+----------------------------------+-----+--------+---------------------------------------------------------------+--------------------------+------------+-----------+----------------+------------+------------------------------+------------------------+--------------+--------+--------------+
|Unit 0065 Box 5052\r\nDPO AP 27450|94 vE|PM      |Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.2; Trident/5.1)|Simpso

In [None]:
print("Total Number of People whom have this IP address: ", df.filter(df['IP Address'] == '132.207.160.22').count())

Total Number of People whom have this IP address:  1


In [None]:
#show email of 132.207.160.22
df.select("Email").filter(df['IP Address'] == '132.207.160.22').show(truncate=False)

+------------------------------+
|Email                         |
+------------------------------+
|amymiller@morales-harrison.com|
+------------------------------+



14. How many People have Mastercard as their Credit Card Provider and made a purchase above 50 dollars?

In [None]:
#count how many people have a certain Credit Card Provider
df.groupBy('CC Provider').agg(count("CC Provider").alias("Number of People with that Certain CC Provider")).show(1000, truncate=False)

+---------------------------+----------------------------------------------+
|CC Provider                |Number of People with that Certain CC Provider|
+---------------------------+----------------------------------------------+
|VISA 16 digit              |1715                                          |
|VISA 13 digit              |777                                           |
|Discover                   |817                                           |
|Diners Club / Carte Blanche|767                                           |
|American Express           |849                                           |
|Maestro                    |846                                           |
|Mastercard                 |816                                           |
|JCB 16 digit               |1716                                          |
|JCB 15 digit               |868                                           |
|Voyager                    |829                                           |

In [None]:
print("Total Number of People whom have this Credit Card Provider: ", df.filter(df['CC Provider'].contains("Mastercard")).count())

Total Number of People whom have this Credit Card Provider:  816


In [None]:
print("Number of People whom have Master and made purchases over 50.00: ", df.filter((df['CC Provider'] == 'Mastercard') & ~(df['Purchase Price'] > 50.00)).count())

Number of People whom have Master and made purchases over 50.00:  411


15. Find the email of the person with the following Credit Card Number: 4664825258997302

In [None]:
#show email of 132.207.160.22
df.select("Email").filter(df['Credit Card'] == '4664825258997302').show(truncate=False)

+-----------------+
|Email            |
+-----------------+
|bberry@wright.net|
+-----------------+



16. How many people purchase during the AM and how many people purchase during PM?

In [None]:
print("Total Number of People whom made a purchase during AM time range: ", df.filter(df['AM or PM'].contains("AM")).count())

Total Number of People whom made a purchase during AM time range:  4932


In [None]:
print("Total Number of People whom made a purchase during PM time range: ", df.filter(df['AM or PM'].contains("PM")).count())

Total Number of People whom made a purchase during PM time range:  5068


Only a little more amount of people made purchases during the PM time range.

17. How many people have a credit card that expires in 2020? 

In [None]:
print("Total Number of People whose credit card expires in 2021 year: ", df.filter(df['CC Exp Date'].contains("/21")).count())

Total Number of People whose credit card expires in 2021 year:  1006


In [None]:
print("Total Number of People whose credit card expires in 2020 year: ", df.filter(df['CC Exp Date'].contains("/20")).count())

Total Number of People whose credit card expires in 2020 year:  988


In [None]:
print("Total Number of People whose credit card expires in 2019 year: ", df.filter(df['CC Exp Date'].contains("/19")).count())

Total Number of People whose credit card expires in 2019 year:  995


In [None]:
print("Total Number of People whose credit card expires in 2018 year: ", df.filter(df['CC Exp Date'].contains("/18")).count())

Total Number of People whose credit card expires in 2018 year:  995


In [None]:
print("Total Number of People whose credit card expires in 2017 year: ", df.filter(df['CC Exp Date'].contains("/17")).count())

Total Number of People whose credit card expires in 2017 year:  955


In [None]:
print("Total Number of People whose credit card expires in 2016 year: ", df.filter(df['CC Exp Date'].contains("/16")).count())

Total Number of People whose credit card expires in 2016 year:  376


In [None]:
print("Total Number of People whose credit card expires in 2015 year: ", df.filter(df['CC Exp Date'].contains("/15")).count())

Total Number of People whose credit card expires in 2015 year:  0


18. What are the top 5 most popular email providers (e.g. Gmail.com, yahoo.com, etc...)
a. Create either user defined function or use your own any other way for getting the
result.


In [None]:
#importing pandas and regex libraries
import re as regex
from collections import Counter
from itertools import *

def get_domains(columnName):
  #initializing pandas series
  input_data = pandasDF[columnName]

  #initializing valid email pattern (may vary)
  pattern ='@[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}'

  #finding all matching occurrences
  result = [regex.findall(pattern, email) for email in input_data]
  #flatten list
  flat_list = [x for xs in result for x in xs]
  #returns number of occurences of each email-domain
  dictionary = Counter(flat_list)

  #sort dictionary structure in descending order
  sorted_dictionary = dict(sorted(dictionary.items(), key=lambda y: y[1], reverse=True))
  #print out results
  for key, value in sorted_dictionary.items():
    print(key, ": ", value)

get_domains("Email")

@hotmail.com :  1638
@yahoo.com :  1616
@gmail.com :  1605
@smith.com :  42
@williams.com :  37
@brown.com :  29
@johnson.com :  29
@davis.com :  25
@jones.com :  25
@martinez.com :  19
@white.com :  15
@martin.com :  15
@garcia.com :  14
@jackson.com :  14
@rodriguez.com :  14
@thompson.com :  14
@miller.com :  14
@thomas.com :  14
@hill.com :  13
@harris.com :  12
@taylor.com :  12
@wilson.com :  11
@moore.com :  11
@walker.com :  11
@nelson.com :  10
@roberts.com :  10
@lee.com :  10
@long.com :  10
@anderson.com :  10
@king.com :  10
@hernandez.com :  10
@hall.com :  9
@wright.com :  9
@smith.org :  9
@sanchez.com :  9
@edwards.com :  9
@cook.com :  9
@williams.biz :  9
@lewis.com :  9
@turner.com :  9
@young.com :  9
@reed.com :  8
@allen.com :  8
@johnson.info :  8
@fisher.com :  8
@richardson.com :  8
@gonzalez.com :  8
@gonzales.com :  8
@smith.info :  7
@jones.biz :  7
@green.com :  7
@smith.biz :  7
@lopez.com :  7
@ross.com :  7
@jones.net :  7
@perez.com :  7
@stewart.com :

Additional Analysis

Browser Stats For purchases



Opera/8.93.(Windows 98; Win 9x 4.90; en-US) Presto/2.9.176 Version/11.00  

Users were only using mozilla or opera as their browser.

In [None]:
#split up browser info column by '/'
pandasDF[['browser_extra', 'browser_extra_2']] = pandasDF['Browser Info'].str.split('/', 1, expand=True)

#count how many people have a certain browser type
pandasDF.groupby('browser_extra').size()


browser_extra
Mozilla    7924
Opera      2076
dtype: int64

In [None]:
print("Percentage of Mozilla Users: %", pandasDF.groupby('browser_extra').size()[0] * 100 / len(pandasDF))

Percentage of Mozilla Users: % 79.24


In [None]:
print("Percentage of Opera Users: %", pandasDF.groupby('browser_extra').size()[1] * 100 / len(pandasDF))

Percentage of Opera Users: % 20.76


Computer Usage stats for purchases

In [None]:
#split up browser info column by '('
pandasDF[['browser_extra', 'browser_extra_3']] = pandasDF['Browser Info'].str.split('(', 1, expand=True)

#count how many people have a certain computer type
pandasDF.groupby('browser_extra_3').size()


browser_extra_3
Macintosh; Intel Mac OS X 10_5_0 rv:2.0; sl-SI) AppleWebKit/533.25.6 (KHTML, like Gecko) Version/4.0 Safari/533.25.6                     1
Macintosh; Intel Mac OS X 10_5_0 rv:4.0; en-US) AppleWebKit/535.34.7 (KHTML, like Gecko) Version/5.0 Safari/535.34.7                     1
Macintosh; Intel Mac OS X 10_5_0 rv:5.0; sl-SI) AppleWebKit/532.10.4 (KHTML, like Gecko) Version/5.0.2 Safari/532.10.4                   1
Macintosh; Intel Mac OS X 10_5_0 rv:6.0; en-US) AppleWebKit/533.45.4 (KHTML, like Gecko) Version/4.1 Safari/533.45.4                     1
Macintosh; Intel Mac OS X 10_5_0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/14.0.824.0 Safari/5321                                     1
                                                                                                                                        ..
iPod; U; CPU iPhone OS 4_3 like Mac OS X; sl-SI) AppleWebKit/535.3.2 (KHTML, like Gecko) Version/4.0.5 Mobile/8B114 Safari/6535.3.2      1
iPod; U; CP

In [None]:
pandasDF[['browser_extra5', 'browser_extra_4']] = pandasDF['browser_extra_3'].str.split(';', 1, expand=True)
pandasDF.groupby('browser_extra5').size()

browser_extra5
Macintosh                                                                             2061
Windows                                                                                630
Windows 95                                                                             147
Windows 95) AppleWebKit/5310 (KHTML, like Gecko) Chrome/14.0.860.0 Safari/5310           1
Windows 95) AppleWebKit/5311 (KHTML, like Gecko) Chrome/13.0.846.0 Safari/5311           1
                                                                                      ... 
Windows NT 6.2) AppleWebKit/5362 (KHTML, like Gecko) Chrome/15.0.809.0 Safari/5362       1
Windows NT 6.2) AppleWebKit/5362 (KHTML, like Gecko) Chrome/15.0.888.0 Safari/5362       1
X11                                                                                   2322
compatible                                                                            2052
iPod                                                                       

In [None]:
#break down count of devices further
pandasDF[['browser_extra_6', 'browser_extra_7']] = pandasDF['browser_extra5'].str.split(')', 1, expand=True)
#count frequency of devices used to make purchases in dataset
pandasDF.groupby('browser_extra_6').size()

browser_extra_6
Macintosh          2061
Windows             630
Windows 95          202
Windows 98          390
Windows CE          184
Windows NT 4.0      183
Windows NT 5.0      181
Windows NT 5.01     209
Windows NT 5.1      209
Windows NT 5.2      199
Windows NT 6.0      179
Windows NT 6.1      210
Windows NT 6.2      166
X11                2322
compatible         2052
iPod                623
dtype: int64

In [None]:
#break down count of devices further
pandasDF[['browser_extra_8', 'browser_extra_9']] = pandasDF['browser_extra_6'].str.split(' NT', 1, expand=True)
#count frequency of devices used to make purchases in dataset
print(pandasDF.groupby('browser_extra_8').size())
print("Total Number of Devices used: ", pandasDF.groupby('browser_extra_8').size().sum())

browser_extra_8
Macintosh     2061
Windows       2166
Windows 95     202
Windows 98     390
Windows CE     184
X11           2322
compatible    2052
iPod           623
dtype: int64
Total Number of Devices used:  10000


In [None]:
#importing pandas and regex libraries
import re as regex
from collections import Counter
from itertools import *

def get_frequency(columnName, string_search_pattern_input):
  #initializing pandas series
  input_data = pandasDF[columnName]

  #initializing valid email pattern (may vary)
  pattern = string_search_pattern_input

  #finding all matching occurrences
  result = [regex.findall(pattern, email) for email in input_data]
  #flatten list
  flat_list = [x for xs in result for x in xs]
  #returns number of occurences of each email-domain
  dictionary = Counter(flat_list)

  #sort dictionary structure in descending order
  sorted_dictionary = dict(sorted(dictionary.items(), key=lambda y: y[1], reverse=True))
  #print out results
  for key, value in sorted_dictionary.items():
    print(key, ": ", value)

get_frequency("Browser Info", 'Mac OS X')

Mac OS X :  2684


Number OS
---------
Mac OS X :  2684

X11 :  2322

Windows CE :  415

iPod :  623

Windows :  5624
Mac :  4745

compatible :  2052

Windows 98 :  809
Windows 95 :  445

Macintosh :  2061

Windows NT :  3325


Windows CE (Embedded System):  415

Windows :  5624

I am getting different counts of parts of the strings