<?xml version="1.0" encoding="UTF-8"?>
<commit>
  <added type="array"/>
  <modified type="array">
    <modified>
      <diff>@@ -1,25 +1,32 @@
 # Settings for the MailmanArchiveScraper script.
-# This file should be in the same directory as the script and named 'MailmanArchiveScraper.cfg'.
+#
+# Copy the MailmanArchiveScraper-example.cfg, rename it MailmanArchiveScraper.cfg
+# and put it in the same directory as the scrip itself.
 #
 # The minimum settings you need to set are:
 # 1. domain -- The domain name that your Mailman pages are on.
 # 2. list_name -- Name of your mailing list.
 # 3. email and password -- Required if your Mailman archive is password protected.
 # 4. publish_dir -- The path to the local directory the files should be republished to.
+# 5. publish_url -- If you're publishing the messages to a website.
+#
+# If you want an RSS feed to be created you'll also need to fill in the settings in that section. 
 
 
+###############################################################################
 [Mailman]
 # Settings associated with the remote server.
 
+
 # The domain name that your Mailman pages are on.
-# eg, in http://lists.mydomain.com/mailman/listinfo/list-name
-# it's the 'lists.mydomain.com' part.
+# eg, in http://lists.example.com/mailman/listinfo/list-name
+# it's the 'lists.example.com' part.
 domain =  
 
 
 # Name of your mailing list.
 # This can be found in the URL of the list info page.
-# eg, in http://lists.mydomain.com/mailman/listinfo/list-name
+# eg, in http://lists.example.com/mailman/listinfo/list-name
 # it's the 'list-name' part.
 list_name = 
 
@@ -30,18 +37,19 @@ email =
 password =  
 
 
-
+###############################################################################
 [Conversion]
 # Settings about how to filter the pages before saving copies.
 
+
 # Remove email addresses or not? 
 # Either 1 or 0 (for yes or no).
 filter_email_addresses = 0
 
 
 # We'll replace all occurrences of the main list info url 
-# eg, http://lists.mydomain.com/mailman/listinfo/list-name 
-# or, http://lists.mydomain.com/pipermail/list-name
+# eg, http://lists.example.com/mailman/listinfo/list-name 
+# or, http://lists.example.com/pipermail/list-name
 # with this string. If you don't want it changed, put the main list info url here. 
 list_info_url = 
 
@@ -64,20 +72,54 @@ strip_quotes = 0
 #	Ancient//Modern
 search_replace = 
 
+
 # The path to a file containing HTML. The contents will be inserted in the &lt;head&gt; of every page.
 # eg, Use this to insert links to stylesheets, Google Analytics javascript, etc.
 head_html = 
 
 
+
+###############################################################################
+[RSS]
+# Settings related to the generated RSS file.
+# You can leave all these blank if you don't want an RSS feed generated.
+
+
+# The local path where the RSS file should be generated.
+# eg /Users/phil/Sites/lists/html/list-name/rss.xml
+rss_file = 
+
+
+# The number of recent messages to be included in the RSS file. 
+items_for_rss = 7
+
+
+# The title of the RSS feed.
+rss_title = 
+
+
+# The description of the RSS feed (usually one sentence).
+rss_description = 
+
+
+
+###############################################################################
 [Local]
 # Settings about the local archive.
 
+
 # The absolute path to a local directory in which all the lists' HTML files will be stored.
 # If it doesn't exist the script will try to create it.
-# eg /Users/phil/Sites/lists/html/list-name/
+# eg /Users/phil/Sites/examplesite/html/list-name/
 publish_dir = 
 
 
+# The full web address to the directory referred to by publish_dir.
+# eg http://www.example.com/list-name/
+# If you're not publishing these on the web, leave this blank.
+publish_url = 
+
+
 # How many hours back should we look back for new messages when the script runs?
 # Set it to more than however often the script is run via cron.
 # eg, if you run the script every hour, set this to maybe 4ish, to allow for times the script might fail.</diff>
      <filename>MailmanArchiveScraper-example.cfg</filename>
    </modified>
    <modified>
      <diff>@@ -1,13 +1,13 @@
 &quot;&quot;&quot;
-* Scrapes the archive pages of one or more lists in a Mailman installation and republishes the contents.
-* v1.0, 2009-04-05
+* Scrapes the archive pages of one or more lists in a Mailman installation and republishes the contents, with an optional RSS feed.
+* v1.1, 2009-05-04
 * http://github.com/philgyford/mailman-archive-scraper/
 * 
 * Only works with Monthly archives at the moment.
 * Could do with more error checking, especially around loadConfig().
 * Hasn't had a huge amount of testing -- use with care.
 &quot;&quot;&quot;
-import ClientForm, ConfigParser, email.utils, mechanize, os, re, sys, time, urlparse
+import ClientForm, ConfigParser, datetime, email.utils, mechanize, os, PyRSS2Gen, re, sys, time, urlparse
 from BeautifulSoup import BeautifulSoup
 
 
@@ -16,7 +16,6 @@ class MailmanArchiveScraper:
     Scrapes the archive pages of one or more lists in a Mailman installation and republishes the contents.
     &quot;&quot;&quot;
     
-
     def __init__(self):
         self.loadConfig()
 
@@ -36,8 +35,13 @@ class MailmanArchiveScraper:
         if not os.path.exists(self.publish_dir):
             os.mkdir(self.publish_dir)
             
+        # We'll keep track of how many items (emails) we fetch with this.
+        self.messages_fetched = 0
+        
+        self.prepareRSS()
+        
         self.prepareRegExps()
-            
+        
 
     def loadConfig(self):
         &quot;Loads configuration from the MailmanArchiveScraper.cfg file&quot;
@@ -79,7 +83,14 @@ class MailmanArchiveScraper:
                     self.error(&quot;'&quot;+sr+&quot;' is not a valid search_replace string.&quot;)
                 self.search_replace[search] = replace
         
+        
+        self.rss_file = config.get('RSS', 'rss_file')
+        self.items_for_rss = int(config.get('RSS', 'items_for_rss'))
+        self.rss_title = config.get('RSS', 'rss_title')
+        self.rss_description = config.get('RSS', 'rss_description')
+        
         self.publish_dir = config.get('Local', 'publish_dir')
+        self.publish_url = config.get('Local', 'publish_url')
         self.hours_to_go_back = int(config.get('Local', 'hours_to_go_back'))
         self.verbose = config.getboolean('Local', 'verbose')
 
@@ -92,44 +103,115 @@ class MailmanArchiveScraper:
         Although the regexps are set here, they might not be used in filterPages(),
         depending on the config settings.
         &quot;&quot;&quot;
-        
+
         # Remove all standard emails, eg &quot;billy@nomates.com&quot; or &quot;&lt;billy@nomates.com&gt;&quot;
         self.match_email = re.compile(r'\b&lt;?[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}&gt;?\b', re.IGNORECASE)
-        
+
         # Remove all email addresses obscured by Mailman, eg &quot;billy at nomates.com&quot;
         self.match_text_email = re.compile(r'\b[A-Z0-9._%+-]+\sat\s[A-Z0-9.-]+\.[A-Z]{2,4}\b', re.IGNORECASE)
-        
+
         # Remove all mailto: links. Replaces them with '#'
         self.match_mailto = re.compile(r'mailto\:[^&quot;]*', re.IGNORECASE)
-        
+
         # Replace any remaining links to the original list pages with #
         # A bit messy, but just in case.
         # eg, for links to message attachments.
         self.match_list_url = re.compile(r''+self.list_url, re.IGNORECASE)
-        
+
         # Replace the list info url with our custom one from the config
         self.match_list_info_url = re.compile(r'http://' + self.domain + '/mailman/listinfo/' + self.list_name, re.IGNORECASE)
-        
+
         # Matches lines that beging with &lt;/I&gt;&amp;gt;&lt;i&gt;
         # With the number of '&amp;gt;' depending on the level of self.strip_quotes.
         if self.strip_quotes &gt; 0:
             min_level_to_strip = self.strip_quotes + 1
             self.match_strip_quotes = re.compile(r'&lt;/I&gt;(&amp;gt;){' + str(min_level_to_strip) + ',}&lt;i&gt;\s.*?\n', re.IGNORECASE)
-        
+
         # Prepare a dictionary of regexp =&gt; replacement for each of the search_replace terms.
         self.match_search_replace = {}
         for search, replace in self.search_replace.iteritems():
             regexp = re.compile(r''+search, re.IGNORECASE)
             self.match_search_replace[regexp] = replace
-        
+
+        # For inserting custom HTML just before the end of the &lt;head&gt;&lt;/head&gt; section.
         self.match_head_html = re.compile(r'&lt;/head&gt;', re.IGNORECASE)
-        
+
+        # For removing anything before the subject of the message.
+        # Probably something like &quot;[List name]  Subject of the message&quot;.
+        self.match_subject = re.compile(r'^(?:\[.*?\]\s+)?', re.IGNORECASE)
+
 
     def scrape(self):
         if not self.public_list:
             self.logIn()
-            
+
         self.scrapeList()
+
+        self.publishRSS()
+        
+    
+    def prepareRSS(self):
+        &quot;&quot;&quot;Prepare things for the RSS feed.&quot;&quot;&quot;
+        
+        if self.rss_file == '':
+            # We're not generating an RSS feed.
+            return
+            
+        self.rss = PyRSS2Gen.RSS2(
+            title = self.rss_title,
+            link = self.list_info_url,
+            description = self.rss_description,
+            lastBuildDate = datetime.datetime.now()
+        )
+        
+        # Items will be added in self.scrapeMessage().
+        self.rss_items = []
+    
+    
+    def addRSSItem(self, message_url, message_time, soup):
+        &quot;&quot;&quot;
+        Add an item to the RSS feed.
+        message_url - The local, newly-published, URL to this item on the web.
+        message_time - The timestamp of when this email was sent.
+        soup - A BeautifulSoup object of the HTML page the message was originally on.
+        &quot;&quot;&quot;
+
+        if self.rss_file == '':
+            # We're not generating an RSS feed.
+            return
+
+        # Get the subject of the message
+        subject = soup.h1.string
+        # Remove any preliminary &quot;[List name] &quot; stuff.
+        subject = self.match_subject.sub(r'', subject)
+
+        # Body of the message including HTML tags.
+        # (Not used at the moment.)
+        #body_html = str(soup.pre)
+
+        # Body of the message (everything within &lt;pre&gt;&lt;/pre&gt; tags) with all HTML tags stripped.
+        body_text = ''.join(soup.pre.findAll(text=True))
+
+        # Add this message to the RSS feed.
+        self.rss_items.append(
+         PyRSS2Gen.RSSItem(
+             title = subject,
+             link = message_url,
+             description = self.smartTruncate(body_text, 500),
+             pubDate = datetime.datetime.fromtimestamp(message_time)
+         )
+        )
+    
+    
+    def publishRSS(self):
+        &quot;&quot;&quot;Publish the accumulated RSS items.&quot;&quot;&quot;
+
+        if self.rss_file == '':
+            # We're not generating an RSS feed.
+            return
+
+        self.rss.items = self.rss_items
+        self.rss.write_xml(open(self.rss_file, &quot;w&quot;), 'utf-8')
         
         
     def logIn(self):
@@ -170,7 +252,7 @@ class MailmanArchiveScraper:
         filtered_source = self.filterPage(source)
 
         # Save our local copy.
-        # eg /Users/phil/Sites/lists/html/list-name/index.html
+        # eg /Users/phil/Sites/examplesite/html/list-name/index.html
         local_index = open(self.publish_dir + '/index.html', 'w')
         local_index.write(filtered_source)
         local_index.close()
@@ -199,11 +281,11 @@ class MailmanArchiveScraper:
         date is a string of the form '2009-February'
         &quot;&quot;&quot;
         
-        # eg http://www.mydomain.com/mailman/private/list-name/2009-February
+        # eg http://lists.example.com/mailman/private/list-name/2009-February
         month_url = self.list_url + '/' + date
         
         # Get the directory the month files will be saved in.
-        # eg /Users/phil/Sites/lists/html/list-name/2009-February
+        # eg /Users/phil/Sites/examplesite/html/list-name/2009-February
         url_parts = month_url.split('/')
         month_dir = self.publish_dir + '/' + url_parts[-1]
         if not os.path.exists(month_dir):
@@ -220,16 +302,17 @@ class MailmanArchiveScraper:
         
         # Get all the links to individual message pages.
         keep_fetching = True
-        messages_scraped = 0
+        messages_fetched_this_month = 0
         for a in anchors:
             link = a.get('href', '')
             if link:
                 # Fetch this message's page and save it.
                 # hours will be how many hours ago this message was sent.
                 hours = self.scrapeMessage(urlparse.urljoin(month_url+'/', link))
-                messages_scraped += 1
-                
-                if self.hours_to_go_back &gt; 0 and hours &gt; self.hours_to_go_back:
+                messages_fetched_this_month += 1  # Count just for this month.
+                self.messages_fetched += 1  # Overall count.
+
+                if self.hours_to_go_back &gt; 0 and hours &gt; self.hours_to_go_back and self.messages_fetched &gt;= self.items_for_rss:
                     # We'll send a signal back to scrapeList() that we don't want to get any previous months.
                     keep_fetching = False
                     break
@@ -239,7 +322,7 @@ class MailmanArchiveScraper:
                     
         # Fetch all the non-date index files for this month and save copies.
         # There's been at least one new message, so get new copies of the other index pages.
-        if (messages_scraped == 1 and keep_fetching) or (messages_scraped &gt; 1):
+        if (messages_fetched_this_month == 1 and keep_fetching) or (messages_fetched_this_month &gt; 1):
             for file in ['thread', 'subject', 'author']:
                 source = self.fetchMonthFile(month_url, month_dir, file+'.html')
             
@@ -252,8 +335,8 @@ class MailmanArchiveScraper:
     def fetchMonthFile(self, remote_dir, local_dir, file_name):
         &quot;&quot;&quot;
         Fetches one of the monthly index pages (date.html, author.html, subject.html, thread.html).
-        remote_dir is like http://www.mydomain.com/mailman/private/list-name/2009-February
-        local_dir is like /Users/phil/Sites/lists/html/list-name/2009-February
+        remote_dir is like http://lists.example.com/mailman/private/list-name/2009-February
+        local_dir is like /Users/phil/Sites/examplesite/html/list-name/2009-February
         file_name is like date.html
         &quot;&quot;&quot;
         
@@ -263,7 +346,7 @@ class MailmanArchiveScraper:
         filtered_source = self.filterPage(source)
 
         # Save our local copy.
-        # eg /Users/phil/Sites/lists/html/list-name/2009-February/date.html
+        # eg /Users/phil/Sites/examplesite/html/list-name/2009-February/date.html
         local_month = open(local_dir + '/' + file_name, 'w')
         local_month.write(filtered_source)
         local_month.close()
@@ -272,10 +355,10 @@ class MailmanArchiveScraper:
         return source
         
         
-                    
     def scrapeMessage(self, message_url):
         &quot;&quot;&quot;
-        Fetches the page for a single message and saves it locally. 
+        Fetches the page for a single message and saves it locally.
+        Adds the message to the RSS feed items.
         Returns the number of hours old this message is.
         &quot;&quot;&quot;
         
@@ -288,23 +371,28 @@ class MailmanArchiveScraper:
         hours_ago = (time.time() - message_time) / 3600
 
         # Remove all the stuff we don't want.
-        source = self.filterPage(source)
-        
-  
+        source = self.filterPage(source)        
           
         # Get the directory the message file is in.
         # It should already have been created in scrapeMonth()
-        # eg http://www.mydomain.com/mailman/private/list-name/2009-February/000042.html
+        # eg http://lists.example.com/mailman/private/list-name/2009-February/000042.html
         url_parts = message_url.split('/')
-        # eg /Users/phil/Sites/lists/html/list-name/2009-February
-        message_dir = self.publish_dir + '/' + url_parts[-2]
+        # eg /Users/phil/Sites/examplesite/html/list-name/2009-February
+        message_dir = self.publish_dir + url_parts[-2]
         
         # Save our local copy.
-        # eg /Users/phil/Sites/lists/html/list-name/2009-February/000042.html
+        # eg /Users/phil/Sites/examplesite/html/list-name/2009-February/000042.html
         local_message = open(message_dir + '/' + url_parts[-1], 'w')
         local_message.write(source)
         local_message.close()
         
+        # Create the URL for linking to this message from the RSS feed.
+        # eg http://www.example.com/list-name/2009-February/000042.html
+        local_message_url = self.publish_url + url_parts[-2] + '/' + url_parts[-1]
+
+        # Add this message to the RSS feed items...        
+        self.addRSSItem(local_message_url, message_time, soup)
+        
         return hours_ago
         
         
@@ -354,6 +442,14 @@ class MailmanArchiveScraper:
         fp.close()
         
         return source
+
+
+    def smartTruncate(self, content, length=100, suffix='...'):
+        &quot;Truncates a string at a word boundary.&quot;
+        if len(content) &lt;= length:
+            return content
+        else:
+            return content[:length].rsplit(' ',1)[0] + suffix
         
         
     def message(self, text):</diff>
      <filename>MailmanArchiveScraper.py</filename>
    </modified>
    <modified>
      <diff>@@ -1,12 +1,13 @@
 # Mailman Archive Scraper
 
 By Phil Gyford &lt;phil@gyford.com&gt;  
-v1.0, 2009-04-05
+v1.1, 2009-05-04
 
 Latest version is available from &lt;http://github.com/philgyford/mailman-archive-scraper/&gt;
 
 This script will scrape the archive pages generated by the Mailman mailing list manager &lt;http://www.gnu.org/software/mailman/index.html&gt; and republish them as files on the local file system. In addition it can optionally do a number of things:
 
+* Create an RSS feed linking to recent messages.
 * Scrape private Mailman archives (if you have a valid email address and password).
 * Remove all email addresses from the files (both those in 'phil@gyford.com' and 'phil at gyford dot com' format).
 * Replace the URL for the 'more info on this list' links with another.
@@ -14,14 +15,18 @@ This script will scrape the archive pages generated by the Mailman mailing list
 * Search and replace any custom strings you specify.
 * Add custom HTML into the &lt;head&gt;&lt;/head&gt; section of the re-published pages.
 
-Why would you want to do this? A couple of reasons:
+Why would you want to do this? Three reasons:
 
 1. You want to create your own HTML archive of a mailing list hosted elsewhere.
 
 2. You want to create a public version of a private archive. We hope you have permission to do this of course. The tools mentioned above allow you to do things like anonymise names and phone numbers, remove email addresses, etc.
 
+3. To have an RSS feed of recent messages.
+
 There may be more efficient ways to do this if you have access to the database in which the Mailman archive is stored. If you don't, and can only access the web pages, this script is for you.
 
+This script doesn't store any state locally between sessions so every time it's run it will have to scrape several pages, even if nothing's changed (particularly if you want an RSS feed of n recent messages). There is a half second delay between each fetch of a remote page, which slows things up but will hopefully prevent hammering web servers.
+
 **There are caveats.** I have only tested this with a couple of Mailman archives (one private, one public) and it seems to work fine. I'm sure that some people will find problems with different installations -- unscrapeable HTML, different URLs and filepaths, etc. Feel free to suggest fixes.
 
 
@@ -34,6 +39,7 @@ There may be more efficient ways to do this if you have access to the database i
 	* BeautifulSoup &lt;http://www.crummy.com/software/BeautifulSoup/&gt;
 	* ClientForm &lt;http://wwwsearch.sourceforge.net/ClientForm/&gt;
 	* Mechanize &lt;http://wwwsearch.sourceforge.net/mechanize/&gt;
+	* PyRSS2Gen &lt;http://www.dalkescientific.com/Python/PyRSS2Gen.html&gt;
 5. Make sure the MailmanArchiveScraper.py script is executable (chmod +x).
 
 
@@ -45,8 +51,9 @@ There is help in the configuration file for each setting. The minimum things you
 2. list_name -- Name of your mailing list.
 3. email and password -- Required if your Mailman archive is password protected.
 4. publish_dir -- The path to the local directory the files should be republished to.
+5. publish_url - If you're going to publish the messages to a website.
 
 
-## Planned additions
+## What would also be nice:
 
-* An RSS feed of recent emails.
+* Full text of messages in the RSS feed. I couldn't work out how to easily extend PyRSS2Gen to add content:encoded elements to each item. Suggestions for this are very welcome.</diff>
      <filename>README.markdown</filename>
    </modified>
  </modified>
  <removed type="array"/>
  <parents type="array">
    <parent>
      <id>cb0f375fbb9d76e24c9d5c2a4b2217f286141e27</id>
    </parent>
  </parents>
  <author>
    <name>Phil Gyford</name>
    <email>phil@gyford.com</email>
  </author>
  <url>http://github.com/philgyford/mailman-archive-scraper/commit/9b16ef0df46d631becd52e03866b8c50f2f60303</url>
  <id>9b16ef0df46d631becd52e03866b8c50f2f60303</id>
  <committed-date>2009-05-04T06:51:01-07:00</committed-date>
  <authored-date>2009-05-04T06:51:01-07:00</authored-date>
  <message>Added the ability to generate RSS files using PyRSS2Gen.

Moved to v1.1.</message>
  <tree>20246b7330fc0941f17b7f60d87aefb0e65ce34a</tree>
  <committer>
    <name>Phil Gyford</name>
    <email>phil@gyford.com</email>
  </committer>
</commit>
